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METHODS FOR ANALYSIS OF SPECTRAL DATA AND THEIR APPLICATIONS: 
ATHEROSCLEROSIS/CORONARY HEART DISEASE 



RELATED APPLICATIONS 

5 

This application is related to (and where permitted by law, claims priority to): 

(a) United Kingdom patent application GB 0109930.8 filed 23 April 2001; 

(b) United Kingdom patent application GB 01 17428.3 filed 17 July 2001 ; 

(c) United States Provisional patent application USSN 60/307.015 filed 20 July 2001; 
10 the contents of each of which are incorporated herein by reference in their entirety. 

This application is one of five applications filed on even date naming the same applicant: 

(1) attorney reference number WJW/LP5995600 (PCT/GB02/ ); 

(2) attorney reference number WJW/LP599561 8 (PCT/GB02/ ); 

15 (3) attorney reference number WJW/LP5995626 (PCT/GB02/ ); 

(4) attorney reference number WJW/LP5995634 (PCT/GB02/ ); 

(5) attorney reference number WJW/LP5995642 (PCT/GB02/ ); 

the contents of each of which are incorporated herein by reference in their entirety. 



20 TECHNICAL FIELD 

This invention pertains generally to the field of metabonomics, and, more particularly, to 
chemometric methods for the analysis of chemical, biochemical, and biological data, for 
example, spectral data, for example, nuclear magnetic resonance (NMR) spectra, and 
25 their applications, including, e.g., classification, diagnosis, prognosis, etc., especially in 
the context of atherosclerosis/coronary heart disease. 

BACKGROUND 



30 Throughout this specification, including the claims which follow, unless the context 
requires otherwise, the word "comprise," and variations such as "comprises" and 
"comprising," will be understood to imply the inclusion of a stated integer or step or group 
of integers or steps but not the exclusion of any other integer or step or group of integers 
or steps. 



wo 02/086500 PCT/GB02/01854 

-2- 

It must be noted that, as used in the specification and the appended clainns, the singular 
forms "a." "an." and "the" include plural referents unless the context clearly dictates 
othenvise. 

Ranges are often expressed herein as from "about" one particular value, and/or to 
"about** another particular value. When such a range is expressed, another embodiment 
includes from the one particular value and/or to the other particular value. Similariy. 
when values are expressed as approximations, by the use of the antecedent "about," it 
vAW be understood that the particular value fomris another embodiment. 

Biosvstems 

BSosystems can conveniently be viewed at several levels of bio-molecular organisation 
based on biochemistry, l.e., genetic and gene e^qanession (genomic and transcriptomic), 
protein and signalling (proteomic) and metabolic control and regulation (metabonomic). 
There are also important cellular Ionic regulation variations that relate to genetic, 
proteomic and metabolic activities, and systematic studies on these even at the cellular 
and sub-cellular level should also be investigated to complete the full description of the 
bio-molecular organisation of a bio-system. 

Significant progress has been made In developing methods to detennine and quantify 
the biochemical processes occurring in living systems. Such methods are valuable in 
the diagnosis, prognosis and treatment of disease, the development of drugs, for 
improving therapeutic regimes for cunrent drugs, and the like. 

Many diseases of the human or animal body (such as cancers, degenerative diseases, 
autoimmune diseases and the like) have an underlying basis in alterations in the 
expression of certain genes. The expressed gene products, proteins, mediate effects 
such as abnonnal cell growth, cell death or inflammation. Some of these effects are 
caused directly by protein-protein interactions; other are caused by proteins acting on 
small molecules (e.g. "second messengers**) which trigger effects including further gene 
expression. 

Likewise, disease states caused by extemal agents such as viruses and bacteria 
provoke a multitude of complex responses in infected host 
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In a similar manner, the treatment of disease through the administration of drugs can 
result in a wide range of desired effects and unwanted side effects in a patient. 

In recent years, it has been appreciated that the reaction of human and animal subjects 
5 to disease and treatments for them can vary according to the genomic makeup of an 
Individual. This has led to the development of the field of "pharmacogenomics." A fuller 
understanding of how an individual's own genome reacts to a particular disease and/or 
drug treatment will allow the development of new therapies, as well as the refinement of 
existing ones. 

10 

At the genetic level, methods for examining gene expression in response to these types 
of events are often referred to as "genomic methods/' and are concerned with the 
detection and quantification of the expression of an organism's genes, collectively 
referred to as its "genome," usually by detecting and/or quantifying genetic molecules. 
15 such as DMA and RNA. Genomic studies often exploit proprietary "gene chips." which 
are small disposable devices encoded with an array of genes that respond to extracted 
mRNAs produced by cells (see, for example, Klenk et al., 1997). Many genes can be 
placed on a chip array and pattems of gene expression, or changes therein, can be 
monitored rapidly, although at some considerable cost. 

20 

However, the biological consequences of gene expression, or altered gene expression 
following perturbation, are extremely complex. This has led to the development of 
"proteomic methods" which are concerned with the semi-quantitative measurement of 
the production of cellular proteins of an organism, collectively refenred to as its 
25 "proteome" (see, for example, Geisow, 1998). Proteomic measurements utilise a variety 
of technologies, but all involve a protein separation method, e.g., 2D gel-electrophoresis, 
allied to a chemical characterisation method, usually, some form of mass spectrometry. 

At present, genomic methods have a high associated operational cost and proteomic 
30 methods require investment in expensive capital cost equipment and are labour 
Intensive, but both have the potential to be powerful tools for studying biological 
response. The choice of method is still uncertain since careful studies have sometimes 
shown a low conflation between the pattern of gene expression and the pattern of 
protein expression, probably due to sampling for the two technologies at inappropriate 
35 time points. See, e.g.. Gygi et al., 1999. Even in combination, genomic and proteomic 
methods still do not provide the range of infonnation needed for understanding 
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integrated cellular function in a living system, since they do not take account of the 
dynamic metabolic status of the whole organism. 

For example, genomic and proteomic studies may implicate a particular gene or protein 
5 in a disease or a xenobiotic response because the level of expression is altered, but the 
change in gene or protein level may be transitory or may be counteracted downstream 
and as a result there may be no effect at the cellular and/or biochemical level. 
Conversely, sampling tissue for genomic and proteomic studies at inappropriate time 
points may result in a relevant gene or protein being overiooked. 

10 

Gene-based prognosis has yet to become a clinical reality for any major prevalent 
disease, almost all of which have multi-gene modes of inheritance and significant 
environmental impact making it difficult to identify the gene panels responsible for 
susceptibility. 

15 

While genomic and proteomic methods may be useful aids, for example, in drug 
development, they do suffer from substantial limitations. For example, while genomic 
and proteomic methods may ultimately give profound insights into toxicological 
mechanisms and provide new surrogate biomaricers of disease, at present it is very 

20 difficult to relate genomic and proteomic findings to classical cellular or biochemical 
indices or endpoints. One simple reason for this is that with current technology and 
approach, the correlation of the time-response to drug exposure is difficult. Further 
difficulties arise with in vitm cell-based studies. These difficulties are particularty 
Important for ttie many known cases where the metabolism of ttie compound is a 

25 prerequisite for a toxic effect and espedally tme where flie target organ is not the site of 
primary metabolism. This Is particularty true for pro-drugs, where some aspect of In sftu 
chemical (e.g.. enzymatic) modification is required for activity. 

Metabonomics 

30 

A new "metabonomic" approach has been developed which is aimed at augmenting and 
complementing the Jnfonnation provided by genomics and proteomics. "Metabonomics" 
is conventionally defined as "Bie quantitative measurement of the multiparametric 
metabolic response of living systems to patfiophysiological stimuli or genetic 
35 modlficafion" (see, for example, Nicholson et al., 1 999). This concept has arisen 
primarily from the application of NMR spectroscopy to shjdy the metabolic 
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composition of biofluids, cells, and tissues and from studies utilising pattern recognition 
(PR), expert systems and other chemoinfonnatic tools to interpret and classify complex 
NMR-generated metabolic data sets. Metabonomic methods have the potential, 
ultimately, to detennlne the entire dynamic metabolic make-up of an organism. 

5 

As outlined above, each level of bio-molecular organisation requires a series of analytical 
bio-technologies appropriate to the recovery of the individual types of bio-molecular data. 
Genomic, proteomfc and metabonomic technologies by definition generate massive data 
sets which require appropriate multi-variate statistical tools (chemometrics, bio- 
10 informatics) for data mining and to extract useful biological information. These data 
exploration tools also allow the inter-relationships between multivariate data sets from 
the different technologies to be investigated, they facilitate dimension reduction and 
extraction of latent properties and allow multidimensional visualization. 

1 5 This leads to the concept of "bionomics", the quantitative measurement and 

understanding of the integrated function (and dysfunctlon)of biological systems at all 
major levels of bio-molecular organisation. In the study of altered gene expression, 
(known as transcriptornics), the variables are mRNA responses measured using gene 
chips, in proteomics, protein synthesis and asociated post-translational modifications are 

20 typically measured using (mainly) gel-electrophoresis coupled to mass spectrometry. In 
both cases, thousands of variables can be measured and related to biological end-points 
using statistical methods. In metabolic (metabonomic) studies, only NIWR (especially ^H) 
and mass spectrometry has been used to provide this level of data density on bio- 
materials although these data can be supplemented by conventional biochemical 

25 assays. 

For in vivo mammalian studies, the ability to perform metabonomic studies on biofluids 
such as plasma, CSF and urine is very important because it gives integrated systems- 
based infonnatlon on the whole organism. Furthemriore, in clinical settings, for the full 
30 utilization of functional genomic knowledge in patient screening, diagnostics and 
prognostics, it is much more practical and ettiically-acceptable to analyze biofluid 
samples than to perfbmn human tissue biopsies and measure gene responses. 

A pathological condition or a xenoblotic may act at the pharmacological level only and 
35 hence may not affect gene regulation or expression directly. Alternatively significant 
disease or toxicological effects may be completely unrelated to gene switching. For 
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example, exposure to ethanol in vivo may cause many changes In gene expression but 
none of these events explains dmnkenness. In cases such as these, genomic and 
proteomic methods are likely to be ineffective. However, all disease or drug-induced 
pathophysiological perturbations result in disturbances in the ratios and concentrations, 
binding or fluxes of endogenous biochemicals, either by direct chemical reaction or by 
binding to key enzymes or nucleic acids that control metabolism. If these disturbances 
are of suffident magnitude, effects will result which will affect the efficient functioning of 
the whole organism. In body fluids, metabolites are in dynamic equilibrium with those 
inside cells and tissues and, consequently, abnonnal cellular processes in tissues of the 
whole organism following a toxic insult or as a consequence of disease will be reflected 
in altered biofluid compositions. 

Fluids secreted, excreted, or othenwise derived from an organism ("biofluids") provide a 
unique window into its biochemical status since the composition of a given biofluid is a 
consequence of the function of ttie cells that are intimately concerned with the fluid's 
manufacture and secretion. For example, the composition of a particular fluid (e.g., 
urine, blood plasma, milk, etc.) can cany biochemical information on details of organ 
function (or dysfunction), for example, as a result of xenobiotics, disease, and/or genetic 
modification. Similariy, the composition and condition of an organism's tissues are also 
indicators of the organism's biochemical status. 

In general, a xenobiofic is a substance (e.g., compound, composition) which is 
administered to an organism, or to which the organism is exposed. In general, 
xenobiotics are chemical, biochemical or biological spedes (e.g., compounds) which are 
not normally present In that oiganism, or are normally present in that organism, but not 
at the level obtained following administration/ exposure. Examples of xenobiotics Include 
drugs, fonnulated medicines and their components (e.g., vaccines, immunological 
stimulants, inert earner vehicles), infectious agents, pesticides, herbiddes, substances 
present in foods (e.g. plant compounds administered to animals), and substances 
present In the environment 

In general, a disease state pertains to a deviation from the nomial healthy state of the 
organism. Examples of disease states indude, but are not limited to. bacterial, viral, and 
parasitic Infections; cancer In all itsfomis; degenerative diseases (e.g., arthritis, multiple 
sderosis): trauma (e.g., as a result of injury); organ Allure (induding diabetes); 
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cardiovascular disease (e.g., atherosclerosis, thrombosis); and. inherited diseases 
caused by genetic composition (e.g., sickle-cell anaemia). 

In general, a genetic modification pertains to alteration of the genetic composition of an 
5 organism. Examples of genetic modifications include, but are not limited to: the 

incorporation of a gene or genes into an orpanism from another species; Increasing the 
number of copies of an existing gene or genes in an organism; removal of a gene or 
genes firom an organism; and, rendering a gene or genes In an organism norhfuncbonal. 

10 Bioflulds often exhibit very subtle changes in metabolite profile in response to external 
stimuli. This is because the body's cellular systems attempt to maintain homeostasis 
(constancy of internal environment), for example, in the face of cytotoxic challenge. One 
means of achieving this is to modulate the composition of biofluids. Hence, even when 
cellular homeostasis is maintained, subtle responses to disease or toxicity are expressed 

15 in altered blofluid composition. However, dietary, diumal and homional variations may 
also influence biofluid compositions, and it is clearly important to differentiate these 
effects if con-ect biochemical inferences are to be drawn from their analysis. 

Metabonomics offers a number of distinct advantages (over genomics and proteomics) in 
20 a clinical setting: firstly, it can often be perfomied on standard preparations (e.g., of 

serum, plasma, urine, etc.), circumventing the need for specialist preparations of cellular 
RNA and protein required for genomics and proteomics, respectively. Secondly, many of 
the risk factors already identified (e.g.. levels of various lipids in blood) are small 
molecule metabolites which will contribute to the metabonomic dataset. 

25 

Application of NMR to Metabonomics 

One of the most successful approaches to biofluid analysis has been the use of NMR 
spectroscopy (see. for example. Nicholson et al., 1989); similarly, intact tissues have 
30 been successfully analysed using magic-angle-spinning ^H NMR spectroscopy (see, for 
example, Moka et a!., 1998; Tomlins et al., 1998). 

The NMR specbxim of a biofluid provides a metabolic fingerprint or profile of the 
organism from which the biofluid was obtained, and this metabolic fingerprint or profile is 
35 characteristically changed by a disease, toxic process, or genetic modification. For 
example, NMR spectra may be collected for various states of an organism (e.g., pre* 
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dose and various times post-dose, for one or more xenobiotics, separately or In 
combination; healthy (control) and diseased animal; unmodffied (control) and genetically 
modified animal). 

5 For example, In the evaluation of undesired toxic side-effects of drugs, each compound 
or class of compound produces characteristic changes In the concentrations and 
patterns of endogenous metabolites in biofluids that provide infonDation on the sites and 
basic mechanisms of the toxic process. NMR analysis of biofluids has successfully 
uncovered novel metabolic markers of organ-specific toxicity in the laboratory rat, and it 
10 is in this "exploratory" role that HMR as an analytical biochemistry technique excels. 
However, the biomari<er infonnation in HMR spectra of biofluids is very subtle, as 
hundreds of compounds representing many pathways can often be measured 
simultaneously, and It Is this overall metabonomic response to toxic Insult that so well 
characterises the lesion. 

15 

Another Important advantage of NMR-based metabonomlcs over genomics or 
proteomics is the intrinsic analytical accuracy of NMR spectroscopy. Reanalysis of the 
same sample by 1H NIVIR spectroscopy results in a typical.coefRdent of variation for the 
measurement of peak intensities in a spectrum of less than 5% across the whole range 

20 of peaks. Thus if the appropriate experiments are undertaken, on average the value of 
each peak intensity will lie in the range 0.95 to 1.05 of the true value. In addition, it is 
possible using NMR spectroscopy to measure absolute amounts or concentrations of a 
number of analytes whereas using gene chip technology only fold changes can be 
detemiined. The best available accuracy achieved using gene chips is a two fold 

25 change, l.e.. the value for each parameter lies in the range 0.50 to 2.00 fold of the '"true" 
value) and proteomic technology is even less Intrinsically accurate. A similar limitation 
also applies to proteomic studies. 

Although, undoubtedly, technology is Improving at a rapid rate the gap between the 
30 intrinsic accuracies of NMR spectroscopy and gene chip technology is so wide that ft will 
require a revolutionary rather than evolutionary improvement in gene expression 
quantification methodology before it can rival the accuracy of NMR spectroscopy. 

The intrinsic accuracy of NMR provides a distinct advantage when applying pattern 
35 recognition techniques. The multivariate nature of the NMR data means that 
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Classification of samples is possible using a combination of descriptors even when one 
descriptor is not sufficient, because of the Inherently low analytical variation in the data. 

All biological fluids and tissues have their own characteristic physico-<:hemlcal 
5 properties, and these affect the types of NMR experiment that may be usefully 
employed. One major advantage of using NMR spectroscopy to study complex 
blombctures is that measurements can often be made with minimal sample preparation 
(usually with only the addition of 5-10% D2O) and a detailed analytical profile can be 
obtained on the whole biological sample. Sample volumes are small, typically 0.3 to 0.5 
10 mL for standard probes, and as low as 3 pL for microprobes. Acquisition of simple NIWR 
spectra is rapid and efficient using flow-injection technology. It is usually necessary to 
suppress the water NMR resonance. 

Many biofluids are not chemically stable and for this reason care should be taken In their 
15 collection and storage. For example, cell lysis in erythrocytes can easily occur. If a 
substantial amount of D2O has been added, then it is possible that certain NMR 
resonances will be lost by H/D exchange. Freeze-drying of biofluid samples also causes 
the loss of volatile components such as acetone. Biofluids are also very prone to 
microbiological contamination, especially fluids, such as urine, which are difficult to 
20 collect under sterile conditions. Many biofluids contain significant amounts of active 
enzymes, either nomnally or due to a disease state or organ damage, and these 
enzymes may alter the composition of the biofluid following sampling. Samples should 
be stored deep frozen to minimise the effects of such contamination. Sodium azide is 
usually added to urine at the collection point to act as an antimicrobial agent Metal ions 
25 and or chelating agents (e.g., EDTA) may be added to bind to endogenous metal ions 
(e.g., Ca^*, Mg^ and Zn^ and chelating agents (e.g., firee amino adds, especially 
glutamate, cysteine, histidine and aspartate; dtrate) to intentionally alter and/or enhance 
the NMR spectrum. 

30 In all cases the analytical problem usually involves the detection of ""trace" amounts of 
analytes in a very complex matrix of potential Interferences. It is, therefore, critical to 
choose a suitable analytical technique for the particular class of analyte of interest in the 
particular biomatrix which could be, for example, a biofluid or a tissue. High resolution 
NMR spectroscopy (in particular NMR) appears to be particulariy appropriate. The 

35 main advantages of using NMR spectroscopy in this area are the speed of the 
method (with spectra being obtained in 5 to 10 minutes), the requirement for minimal 
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sample preparation, and the fact that it provides a non-selective detector for all 
metabolites in the biofluid regardless of their structural type, provided only that they are 
present above the detection limit of the NMR experiment and that they contain non- 
exchangeable hydrogen atoms. The speed advantage is of cnjcial importance in this 
5 area of work as the clinical condition of a patient may require rapid diagnosis, and can 
change very rapidly and so cooespondingly rapid changes must be made to the therapy 
provided. 

NMR studies of body fluids should ideally be perfomned at the highest magnetic field 
1 0 available to obtain maximal dispersion and sensitivity and most NIVIR studies have 
been performed at 400 MHz or greater. With every new increase in available 
spectrometer frequency the number of resonances that can be resolved in a biofluid 
increases and although this has the effect of solving some assignment problems, it also 
poses new ones. Furthemiore, there are still important problems of spectral 
15 interpretation that arise due to compartmentation and binding of small molecules in the 
organised macromolecular domains that exist in some biofluids such as blood plasma 
and bile. All this complexity need not reduce the diagnostic capabilities and potential of 
the technique, but demonstrates the problems of biological variation and the influence of 
variation on diagnostic certainty. 

20 

The information content of biofluid spectra is very high and the complete assignment of 
the NMR spectmm of most biofluids is usually not possible (even using 900 MHz 
NMR spectroscopy). However, the assignment problems vary considerably between 
bfofluld types. Some fluids have near constant composition and concentrations and in 

25 these the majority of the NMR signals have been assigned. In contrast, urine 

composition can be very variable and there is enomious variation in the concentration 
range of NMR^Ietectable metabolites; consequently, complete analysis is much more 
difficult. Those metabolites present close to the limits of detection for 1-dimensional (1D) 
NMR spectroscopy (typically ca. 100 nM at 800 MHz) pose severe NMR spectral 

30 assignment problems. (In absolute tenns, the detection limit may be ca. 4 nmol, e.g., 1 
pg of a 250 g/mol compound in a 0.5 mL sample volume.) Even at the present level of 
technology in NMR. it is not yet possible to detect many important biochemical 
substances (e.g. hormones, some proteins, nucleic acids) in body fluids because of 
problems with sensitivity, fine widths, dispersion and dynamic range and this area of 

35 research will continue to be technology-limited. In addition, the coIlecHon of NMR 
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spectra of biofluids may be complicated by the relative water intensity, sample viscosity, 
protein content, lipid content, and low molecular weight peal< overlap. 

Usually in order to assign NMR spectra, comparison Is made with spectra of authentic 
materials and/or by standard addition of an authentic reference standard to the sample. 
Additional confinnation of assignments is usually sought from the application of other 
NMR methods, including, for example, 2-dlmensionaI (2D) NMR methods, particuiariy 
COSY (correlation spectroscopy), TOCSY (total correlation spectroscopy), 
inverse-detected heteronuclear con-elation methods such as HMBC (heteronuclear 
multiple bond coaelation), HSQC (heteronuclear single quantum coherence), and HMQC 
(heteronuclear multiple quantum coherence), 2D J-resolved (JRES) methods, spin-echo 
methods, relaxation editing, diffusion editing (e.g., both ID NMR and 2D NMR such as 
diffusion-edited TOCSY), and multiple quantum filtering. Detailed NMR spectroscopic 
data for a wide range of metabolites and biomolecules found in biofluids have been 
published (see, for example, Lindon et al.. 1999) and supplementary infomiation is 
available in several literature compilations of data (see, for example, Fan, 1996; Sze et 
ai., 1994). 

For example, the successful application of NMR spectroscopy of biofluids to study a 
variety of metabolic diseases and toxic processes has now been well established and 
many novel metabolic mari<ers of organ-specific toxicity have been discovered (see, for 
example, Nicholson et al., 1989; Lindon et al.. 1999). For example, NMR spectra of 
urine is identifiably altered in situations where damage has occurred to the kidney or 
liver. It has been shown that specific and identifiable changes can be obsen/ed which 
distinguish the organ that is the site of a toxic lesion. Also it is possible to focus in on 
particular parts of an organ such as the cortex of the kidney and even in ^vourabie 
cases to very localised parts of the cortex. 

It is also possible to deduce the biochemical mechanism of the xenobiotic toxicity, based 
on a biochemical Interpretation of the changes In the urine, A wide range of toxins has 
now been Investigated including mostly kidney toxins and liver toxins, but also testicular 
toxins, mitochondrial toxins and muscle toxins. 
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Pattern Recognition 

However, a limfting factor In understanding the biochemical information from both ID and 
2D-NMR spectra of tissues and biofluids is their complexity. The most efficient way to 
investigate these complex multiparametric data is employ the ID and 2D NMR 
metabonomic approach in combination with computer-based "pattem recognition" (PR) 
methods and expert systems. These statistical tools are similar to those currently being 
explored by workers in the fields of genomics and proteomlcs. 

Pattem recognition (PR) methods can be used to reduce the complexity of data sets, to 
generate scientific hypotheses and to test hypotheses. In general, the use of pattem 
recognition algorithms allows the identification, and, with some methods, the 
interpretation of some non-random behaviour In a complex system which can be 
obscured by noise or random variations in the parameters defining the system. Also, the 
number of parameters used can be very large such that visualisation of the regularities, 
which for the human brain is best in no more than three dimensions, can be difficulL 
Usually the number of measured descriptors is much greater than three and so simple 
scatter plots cannot be used to visualise any similarity between samples. Pattem 
recognition methods have been used widely to characterise many different types of 
problem ranging for example over linguistics, fingerprinting, chemistry and psychology. 
In the context of the methods described herein, pattem recognition is the use of 
multivariate statistics, both parametric and non-parametric, to analyse spectroscopic 
data, and hence to classify samples and to predict the value of some dependent variable 
based on a range of obsen^ed measurements. There are two main approaches. One 
set of methods is termed "unsupervised" and these simply reduce data complexity in a 
rational way and also produce display plots which can be interpreted by the human eye. 
The other approach is tenned "supervised" whereby a training set of samples with known 
class or outcome is used to produce a mathematical model and this is then evaluated 
with independent validation data sets. 

Unsupenrtsed PR methods are used to analyse data without reference to any other 
Independent knowledge, for example, without regard to the identity or nature of a 
xenobiotic or its mode of action. Examples of unsupervised pattem recognition methods 
include principal component analysis (PCA), hierarchical cluster analysis (HCA), and 
non-linear mapping (NLM). 
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One of the most useful and easily applied unsupervised PR techniques is principal 
components analysis (PCA) (see, for example, KowalskI et al, 1986). Principal 
components (PCs) are new variables created from linear combinations of the starting 
variables with appropriate weighting coefficients. The properties of these PCs are such 
5 that: (i) each PC is orthogonal to (uncorrelated with) all other PCs, and (ii) the first PC 
contains the largest part of the variance of the data set (information content) with 
subsequent PCs containing correspondingly smaller amounts of variance. 

PCA, a dimension reduction technique, takes m objects or samples, each described by 

10 values in K dimensions (descriptor vectors), and extracts a set of eigenvectors, which 
are linear combinations of the descriptor vectors. The eigenvectors and eigenvalues are 
obtained by diagonalisation of the covariance matrix of the data. The eigenvectors can 
be thought of as a new set of orthogonal plotting axes, called principal components 
(PCs). The extraction of the systematic variations in the data is accomplished by 

15 projection and modelling of variance and covariance structure of the data matrix. The 
primary axis is a single eigenvector describing the largest variation in the data, and is 
termed principal component one (PC1). Subsequent PCs, ranked by decreasing 
eigenvalue, describe successively less; variability. The variation in the data that has not 
been described by the PCs is called residual variance and signifies how well the model 

20 fits the data. The projections of the descriptor vectors onto the PCs are defined as 
scores, which reveal the relationships between the samples or objects. In a graphical 
representation (a "scores plot" or eigenvector projection), objects or samples having 
similar descriptor vectors will group together in clusters. Another graphical representation 
is called a loadings plot, and this connects the PCs to the individual descriptor vectors, 

25 and displays both the Importance of each descriptor vector to the interpretation of a PC 
and the relationship among descriptor vectors in that PC. In fact, a loading value is 
simply the cosine of the angle which the original descriptor vector makes with the PC. 
Descriptor vectors which fall close to the origin in this plot carry little Infonnation In the 
PC, while descriptor vectors distant from the origin (high loading) are Important in 

30 interpretation. 

Thus a plot of the first two or three PC scores gives the "besf representation, in terms of 
information content, of the data set in two or three dimensions, respectively. A plot of the 
first two principal component scores, PCI and PC2 provides the maximum infonriation 
35 content of the data in two dimensions. Such PC maps can be used to visualise inherent 
clustering behaviour, for example, for drugs and toxins based on similarity of their 
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metabonomic responses and hence mechanism of action. Of course, the clustering 
information might be in lower PCs and these have also to be examined. 

Hierarchical Cluster Analysis, another unsupervised pattern recognition method, permits 
the grouping of data points which are similar by \rirtue of being "near"* to one another in 
some multidimensional space. Individual data points may be. for example, the signal 
intensities for particular assigned peaks in an NMR spectrum. A "similarity matrix," S, is 
constructed with elements sg = 1 - r^fr{^, where rg is the interpoint distance between 
points i and J (e.g., Euclidean interpoint distance), and r^"^ is the largest interpoint 
distance for all points. The most distant pair of points mil have sg equal to 0. since ry 
then equals r{^. Conversely, the closest pair of points will have the largest s^. For two 
identical points, s^ is 1. 

The similarity matrix is scanned for the closest pair of points. The pair of points are 
reported with their separation distance, and then the two points are deleted and replaced 
with a single combined point. The process is then repeated iteratively until only one 
point remains. A number of different methods may be used to determine how two 
clusters will be joined, including the nearest neighbour method (also known as the single 
link method), the furthest neighbour method, and the centroid method (including centroid 
link, incremental link, median link, group average link, and flexible link variations). 

The reported connectivities are then plotted as a dendrogram (a tree-like chart which 
allows visualisation of clustering), showing sample-sample connectivities versus 
Increasing separation distance (or equivalentiy, versus decreasing similarity). The 
dendrogram has the property In which the branch lengflis are proportional to the 
distances between the various clusters and hence tfie lengtfi of the branches linking one 
sample to the next is a measure of their similarity. In this way, similar data points may 
be identified algorithmically. 

Non-linear mapping (NLM) is a simple concept which involves calculation of the 
distances between all of the points in the original K dimensions. This is followed by 
construction of a map of points in 2 or 3 dimensions where the sample points are placed 
in random positions or at values detenmined by a prior principal components analysis. 
The least squares criterion Is used to move ttie sample points in the lower dimension 
map to fit the inter-point distances in the lower dimension space to tiiose in tiie K 
dimensional space. Non-linear mapping Is ttierefone an approximation to ttie true inter- 
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point distances, but points close In K-dimensional space sliould aiso be ciose in 2 or 3 
dimensional space (see, for example, Brown et al., 1996; Farant et al., 1992). 

In this simple metabonomic approacii, a sample from an animal treated withi a compound 
of unloiown toxicity is compared with a database of NI\/IR-generated metabolic data from 
control and toxin-treated animals. By observing its position on the PR map relative to 
samples of known effect, the unknown toxin can often be classified. The same approach 
can be used for human samples for classification according to disease. However, such 
data are often more complex, wltti time-related biochemical changes detected by NIWR. 
Also, it Is more rigorous to compare effects of xenoblotics in the original K-dimensional 
NMR metabonomic space. 

Altematively, and in order to develop automatic classification methods, it has proved 
efficient to use a "supen^ised" approach to NMR data analysis. Here, a "training set" of 
NMR metabonomic data is used to construct a statistical model tiiat predicts conrectly the 
"dass" of each sample. This training set is tiien tested with independent data (refeaed 
to as a test or validation set) to detemiine the robustness of the computer-based model. 
These models are sometimes termed "expert systems," but may be based on a range of 
different mathematical procedures. Supervised methods can use a data set with 
reduced dimensionality (for example, the first few principal components), but typically 
use unreduced data, with all dimensionality. In all cases the methods allow the 
quantitative description of the multivariate boundaries tiiat characterise and separate 
each class, for example, each class of xenobiotic in terms of its metabolic effects. It is 
also possible to obtain confidence limits on any predictions, for example, a level of 
probability to be placed on tiie goodness of fit (see. for example. Kowalski et al., 1986). 
The robustness of the predictive models can also be checked using cross-validation, by 
leaving out selected samples from tiie analysis. 

Expert systems may operate to generate a variety of useful outputs, for example, 
(i) classification of the sample as "nomnal" or "abnormal" (this is a useful tool in the 
control of spectrometer automation, e.g., using sequential flow injection NMR 
spectroscopy); 00 classification of the target organ for toxicity and site of action within 
the tissue where in certain cases, mechanism of toxic action may also be classified; and, 
(iii) identification of the biomarkers of a pathological disease condition or toxic effect for 
the particular compound under study. For example, a sample can be classified as 
belonging to a single dass of toxicity, to multiple classes of toxicity (more than one target 
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organ), or to no class. The latter case would indicate deviation from normality (control) 
based on the training set model but having a dissimilar metabolic effect to any to)dcity 
dass modelled in the training set {unknown toxicity type). Under (il), a system could also 
be generated to support decisions in clinical medidne (e.g., for efficacy of drugs) rather 
5 than toxicity. 



Examples of supervised pattern recognition methods include the following: 

soft Independent modelling of dass analysis (SIMCA) (see, for example. Wold. 

1976); 

1 0 partial least squares analysis (PLS) (see. for example, Wold, 1 966; Joreskog. 

1982; Frank, 1984; Bro, R., 1997); 

linear descrimlnant analysis (LDA) (see. for example, Nillson, 1965); 

K-nearest neighbour analysis (KNN) (see, for example. Brown et al., 1998); 

artffidal neural networks (ANN) (see, for example, Wasserman, 1989; Anker et 
15 aL, 1992; Hare. 1994); 

probabilistic neural networks (PNNs) (see. for example, Parzen, 1962; Bishop, 
1995; Speckt, 1990; Broomhead et al.. 1988; Patterson. 1996); 

rule induction (Rl) (see. for example, Quinlan. 1986); and. 

Bayesian methods (see. for example. Bretthorst. 1990a. 1990b. 1988). 

20 

As the size of metabonomic databases increases together with Improvements In rapid 
throughput of NMR samples (> 300 samples per day per spectrometer is now possible 
with the first generation of flow injection systems), more subUe expert systems may be 
necessary, for example, using technques such as "fiizzy logic" which permit greater 
25 flexibility in decision boundaries. 



Application to Metabonomics 



Pattem recognition metfiods have been applied to the analysis of metabonomic data. 

30 See, for example, Undon et al., 2001 . A number of spectroscopic techniques have been 
used to generate the data, induding NMR spectroscopy and mass spectrometry. Pattem 
recognition analysis of such data sets has been succesful in some cases. The 
successful studies indude, for example, complex NMR data from biofluids, (see. for 
example, Anthony et ai., 1994; AnUiony et al., 1995; Beckwith-HaH et al.. 1998; Gartland 

35 et al., 1990a; GarUand et al., 1990b; Gartland et ah, 1991; Holmes et al., 1998a; Holmes 
et aL, 1998b; Holmes et al., 1992; Holmes et al., 1994; Spraul et al., 1994; Tranter et al.. 
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1999) conventional NMR spectra from tissue samples (Somorjai et al., 1995). magic- 
angle-spinning (MAS) NMR spectra of tissues (Gan-od et al., 2001), in vivo NMR spectra 
(Morvan et al., 1990; Howells et al., 1993; Stoyanova et al., 1995; Kuesel et al., 1996; 
Confort-Gouny et aL, 1992; Weber et al., 1998), wines (Martin et al., 1998, 1999) and 
5 plant tissues (Kopka et al., 2000). 

Although the utility of the metabonomic approach Is well established, its full potential has 
not yet been exploited. The metabolic variation is often subtle, and powerful analysis 
methods are required for detection of particular analytes, especially when the data (e.g., 
1 0 NMR spectra) are so complex. For example, all that has been previously proposed is 
still not generally sufficient to achieve clinically useful diagnosis of disease. New 
methods to extract useful metabolic information from biofluids are needed. 

The inventors have developed novel methods (which employ multivariate statistical 
15 analysis and pattern recognition (PR) techniques, and optionally data filtering 

techniques) of analysing data (e.g.. NMR spectra) from a test population which yield 
accurate mathematical models which may subsequently be used to classify a test 
sample or subject, and/or in diagnosis. 

20 Unlike methods previously described, the methods described herein have the power to 
provide clinically useful and accurate diagnostic and prognostic information in a medical 
setting. 

The methods described herein represent a significant advance over chemometric 
25 methodologies described previously. Although chemometrics has been able to provide 
some classification of types previously, the studies have required that the classification 
be done under a series of restrictions which limit the ability to apply the method to 
analysis of complex datasets as would be required to apply the method for the practical 
diagnosis/prognosis of diseases that could t>e useful clinically. 

30 

For example, several studies have reported on the classification of animals on the basis 
of an NMR spectrum of urine or plasma. AKhough these studies cleariy demonstrate the 
potential of the technique, ttiey are limited because the animals which compose each 
class are genetically homogenous (in-bred populations). As a result, these meUiods 
35 have been demonstrated to be able to detect patterns but only against "low noise" 

backgrounds. Application of metabonomics to "real" populations (e.g., in human clinical 
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practice) requires tlie ability to detect patterns against tlie substantial noise due to the 
genetic variation of out-bred populations and also due to dietary and homional 
differences. 

5 Similarly, many of the studies described to date have examined relatively major 

differences between groups, for example, the ability to differentiate renally acting toxins 
from liver acting toxins. The two groups under study differed in a broad spectrum of 
metabolites maldng the pattern relatively easy to detect. In conjugation with the 
restriction of using in-bred populations of animals, most studies published to date have 
10 only demonstrated metabonomics to be practicable under conditions of high "signal to 
noise" ratio, conditions which are very different from the human clinical environment 

Some studies have begun to attempt classifications of out-bred human populations 
where the data variation is high. However, to date, all these studies have simplified ttie 

15 system substantially to focus in on specific molecules: for example, some studies have 
looked specifically at the resonances associated with lipoproteins. Since lipoproteins are 
major constituents of plasma, the variance they contribute readily exceeds the 
background variance due to genetic and environmental differences between individuals. 
Unfortunately, such an approach is insufficlentiy powerful to identify weak patterns 

20 against the background biochemical noise, and could not be used, for example, to 
determine ttie extent of coronary heart disease or to distinguish identical from non- 
identical twins. Identification of such low "signal to noise" ratio patterns requires the 
application of the mettiods of this invention, which represent a significant advance over 
what has been previously reported. 



25 
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SUMMARY OF THE INVENTION 

One aspect of the present invention pertains to a method of classifying a sample, as 
5 described herein. 

One aspect of the present invention pertains to a method of classifying a subject as 
described herein. 

10 One aspect of the present invention pertains to a method of diagnosing a subject as 
described herein. 

One aspect of the present invention pertains to a method of identifying a diagnostic 
species, or a combination of a plurality of diagnostic species, for a predetermined 
1 5 condition, as described herein. 

One aspect of the present invention pertains to a diagnostic species identified by a 
method as described herein. 

20 One aspect of the present invention pertains to a diagnostic species identified by a 
method as described herein, for use in a method of classification. 

One aspect of the present invention pertains to a method of classification which employs 
or relies upon one or more diagnostic species identified by a method as described herein 

25 

One aspect of the present invention pertains to use of one or more diagnostic species 
identified by a method of classrficatton as described herein. 

One aspect of the present invention pertains to an assay for use in a method of 
30 classification, which assay relies upon one or more diagnostic species identified by a 
method as described herein. 

One aspect of the present invention pertains to use of an assay in a method of 
classification, which assay relies upon one or more diagnostic species identified by a 
35 method as described herein. 
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One aspect of the present invention pertains to a method of therapeutic monitoring of a 
subject undergoing therapy which employs a method of classificalion as descritied 
herein. 



5 



10 



One aspect of the present invention pertains to a method of evaluating dmg therapy 
and/or drug efficacy which employs a method of classification, as described herein. 

One aspect of the present invention pertains to a computer system or device, such as a 
computer or linked computers, operatively configured to implement a method as 
described herein; and related computer code computer programs, data earners carrying 
such code and programs, and the like. 

These and other aspects of the present Inventton are described herein. 

15 As will be appredated by one of skill in the art, features and preferred embodiments of 
one aspect of the present invention will also pertain to other aspecb of the present 
invention. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1-CHD Is a 600 MHz 1-D *H NMR spectrum for serum obtained from (A) a patient 
5 with nonfnal coronary arteries (NCA); and (B) a patient with triple vessel disease patient 
(TVD). The spectra were recorded at a temperature of 300 K, comected for phase and 
baseline distortions, and chemical shifts were referenced to that of lactate (CHa; 5 1.33). 

Figure 2A-CHD is a scores scatter plot for PC3 and PC2 (tS vs. t2) for the principal 
10 components analysis (RCA) model derived from 1-D NMR spectra from serum 
samples from NCA (circles, •) and TVD (squares, ■) patients. 

Figure 2B-CHD is the corresponding loadings scatter plot (p3 vs. p2) for the PGA shown 
in Figure 2A-CHD, 

15 

Rgure 2C-CHD is a scores scatter plot for PC2 and PCI (t2 vs. t1) for the PCA model 
derived from 1-D ^H NMR spectra from serum samples from NCA (drcles, •) and TVD 
(squares, ■) patients. Prior to PCA, the data were fittered (in this case, using orthogonal 
signal correction, OSC). 

20 

Figure 2D*CHD is the conresponding loadings scatter plot (p2 vs. pi) for the PCA shown 
in Figure 2C-CHD. 

Rgure 2E-CHD is a scores scatter plot for PC2 and PCI (t2 vs. t1) for the PLS-DA model 
25 derived from 1-D ^H NMR spectra from serum samples from NCA (circles, •) and TVD 
(squares, ■) patients. Prior to PCA, the data were filtered (in this case, using orthogonal 
signal correction. OSC). 

Figure 2F-CHD is the comesponding loadings scatter plot (w*c2 vs. W*c1) for the PLS-DA 
30 shown In Rgure 2E-CHD. 

Rgure 3A-CHD shows a section of the variable importance plot (VIP) for the 
OSC-PLS-DA model, showing the calculated importance of the 13 most important 
variables. 

35 
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Rgure 3B-CHD is a plot of the regression coefficients of the 1-D NMR variables for 
the TVD semm samples, derived from the OSC-PLS-DA. Each bar represents a spectral 
region covering 5 0.04. 

Figure 4-CHD is a y-predicted scatter plot, showing NCA (circles. •) and TVD (squares, 
■) samples and validation samples (triangle. A . NCA or TVA as merited), for an 
OSC-PLS-DA model. 

Figure 5A-CHD is the scores scatter plot for PC2 and PCI (t2 vs. t1) for the PCA model 
calculated from 1 -D 'H NMR data for all three classes of serum sample: type "1" vessel 
disease (triangles. A), type "2" vessel disease (circles, .). and type "S" vessel disease 
(squares, ■). 

Figure 5BCHD is the corresponding loadings scatter plot (p2 vs. pi) for the PCA shown 
15 in Rgure 5A-CHD. 

Figure 5C-CHD shows three pairs of plots (a scores scatter plot for PC2 and PCI 

(t2 vs. t1) for a PLS-DA model calculated from 1-D NMR data for pairs of classes of 

serum samples, and the con-esponding w*c loadings plot (wc2 vs. wc1)). In the scores 
20 plots, type "I" samples are denoted by triangles (A); type "2" samples are denoted by 

circles (•); and type "3" samples are.denoted by squares (■). 

Figure 5C-(1)-CHD: type "1 " and "2" scores scatter plot. 

Figure 5CK2)-CHD: type "1" and "2" loadings w*c scatter plot 

F^ure 5(H3)-CHD: type "2" and "3" scores scatter plot 
25 Rgure 5C-(4)-CHD: type "2" and "3" loadings w*c scatter plot 

Rgure 5C-(5)-CHD: type "1" and "3" scores scatter plot. 

Rgure 5C-(6)-CHD; type "1" and "3" loadings Wc scatter plot 



30 



Figure 6A-CHD is a scores scatter plot for PC2 and PCI (t2 vs. t1) calculated for a PCA 
model calculated using filtered 1-D NMR data (in ttiis case, filtered using orthogonal 
signal conedion, OSC). for all three classes of serum sample: type "1" vessel disease 
(triangles. A); type "2" vessel disease (circles, •); and type "3" vessel disease (squares, 



35 



Figure 6B-CHD is the coirespondlng loadings scatter plot (p2 vs. pi) for PCA shown ii 
Figure 5A-CHD. 
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Figure 60CHD shows three pai^ of plots (a scores scatter plot for PC2 and PCI 
(t2 vs. t1) for a PLS-DA model calculated from 1-D NMR data for pairs of classes of 
serum samples, following OSC, and the con-esponding w*c loadings plot {wc2 vs. wc1)). 
5 In the scores plots, type "1" samples are denoted by triangles (A); type "2" samples are 
denoted by circles (•); and type "3" samples are denoted by squares (■). 
Figure 6C-(1)-CHD: type "1" and "2" scores scatter plot. 
Rgure 60(2)-CHD: type "1 " and "2" loadings w*c scatter plot. 
Figure 60(3)-CHD: type "2" and -3" scores scatter plot. 
10 Figure 60(4)-CHD: type "2" and "3" loadings w*c scatter plot. 
Rgure 6C-(5)-CHD: type "1" and "3" scores scatter plot. 
Figure 6C-(6)-CHD: type "1" and "3" loadings w*c scatter plot. 

Figure 7-CHD shows, for each of the three models described in Figure 6C. both a 
1 5 section of the variable Importance plot (VIP) and a plot of the regression coefficients for 
the respective OSC-PL&43A model. Each bar represents a spectral region covering 5 
0.04. 

Figure 7-(1)-CHD: VIP for "1" and ''2" vessel disease samples. 
Figure 7-(2)-CHD: Regression cpeffidents, "1" with respect to "2" vessel disease. 
20 Figure 7-(3)-CHD: VIP for "2" and "3" vessel disease samples. 

Figure 7-(4)-CHD: Regression coefficients, "2" with respect to "3" vessel disease. 

Figure 7-(5)-CHD: VIP for "1" and "3" vessel disease samples. 

Rgure 7-(6)-CHD: Regression coefficients, "1" with respect to "3" vessel disease. 

25 Figure 8-CHD shows three y-predicted scatter plots, showing type "1" (triangles. A), type 

"2" (circles, •), type "3" (squares. ■) and validation samples (diamonds), for PLS-DA 

models calculated for the same data, following OSC. 

Figure 8A-CHD: type "1" and "2". 

Figure 8B-CHD: type "2" and "3". 
30 Figure 8C-CHD: type "1" and "3". 

Rgure 9A-CHD is a scores scatter plot for PC2 and PCI (t2 vs. t1) for a PCA model 
calculated from established clinical parameters for subjects with type "1" (triangles. A), 
type "2'' (circles, •), type "3" (squares, ■) vessel disease. 



35 
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Figure 9B-CHD is the corrBsponding loadings scatter plot (p2 vs. pi) for the PCA shown 
in Figure 9A-CHD. 



Figure 9C-CHD shows three pairs of plots (a scores scatter plot for PC2 and PC1 

(t2 vs. t1) for a PLS-DA model calculated using established clinical parameters, and the 

corresponding loadings w*c plot (w*c2 vs. w*c1)). In the scores plots, type "1" samples 

are denoted by triangles (A); type "2" samples are denoted by circles (•); and type "3" 

samples are denoted by squares (■). 

Rgure 9C-(1)-CHD: type "1" and "2" scores scatter plot 

Figure 9C-(2)-CHD: type "1" and "2" loadings w*c scatter plot 

Rgure 9C-(3)-CHD: type "2" and "3" scores scatter plot 

ngure 9C-(4)-CHD: type "2" and "3" loadings Wc scatter plot 

Rgure 9C-(5)-CHD: type "1" and "3" scores scatter plot 

Rgure 9C-(6)-CHD: type "1" and "3" loadings Wo scatter plot. 

Figure 10-CHD shows, for each of the three models described in Figure 9C, both a 
section of the variable importance plot (VIP) and a plot of the regression coefficients for 
the respective OSC-PLS-DA models. Each bar represents a spectral region covering «5 
0.04. 

Figure 10-(1)-CHD: VIP for "1" and "2" vessel disease samples. 

Figure 10-{2)-CHD: Regres. coefs., "1" with respect to "2" vessel disease. 

Figure 10-(3)-CHD: VIP for "2" and "3" vessel disease samples. 

Figure 10-(4>OHO: Regres. coefs., "2" with respect to "3" vessel disease. 

Figure 10-(5)-CHD: VIP for "1" and "3" vessel disease samples. 

Rgure 1(H6)-CHD: Regres. coefs., "1" with respect to "3" vessel disease. 
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DETAILED DESCRIPTION OF THE INVENTION 

Introduction 

5 

The inventors have developed novel methods (which employ multivariate statistical 
analysis and pattern recognition (PR) techniques, and optionally data filtering 
techniques) of analysing data (e.g., NMR spectra) from a test population which yield 
accurate mathematical models which may subsequently be used to classify a test 
1 0 sample or subject, and/or in diagnosis. 

An NMR spectrum provides a fingerprint or profile for the sample to which It pertains. 
Such spectra represent a measure of all l^R detectable species present in the sample 
(rather than a select few) and also, to some extent, interactions between these species. 
15 As such, these spectra are characterised by a high data density which, heretofore, has 
not been fully exploited. 

The methods described herein facilitate the analysis of such spectra, and the 
subsequent use of the results of that analysis to classify test spectra (and therefore the 
20 associated samples and subjects, if applicable) according to one or more distinguishing 
criteria, at a discrimination level never before achieved. 

These methods find particular application in the field of medicine. For example, analysis 
of NMR spectra for samples taken from a population characterised by a certain condition 
25 yields a mathematical model which can be used to classify an NMR spectmm for a 

sample from a test subject as positive (also having the condition) or negative (not having 
the condition) with a high degree of confidence. 

In efliect, these methods facilitate the identification of the particular combination of 
30 amounts of (e.g., endogenous) species which are invariably associated with the 
presence of the condition. These combinations (pattems), which typically comprise 
many (often small) uncorrelated variances which together are diagnostic, are encoded 
within the high data density of the NMR spectra. The metiiods described herein permit 
their identification and subsequent use for classification. 



35 
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However, it must be stressed that metabonomic analysis based on NMR spectra is much 
more powerful than simply using a high technology analytical tool (the NMR 
spectrometer) to measure the levels of known metabolites. That is, the methods 
described herein are distinct from methods which simply cany out multiple independent 
5 measures of discrete chemical entitities (e.g., LDL cholesterol concentration). 

For example, considering the variance in NMR spectral intensity (total peak intensity) in 
any particular defined chemical shift region (known as a bucket or bin), a part of that 
variance may be associated vwth a given molecule (a biomarker), the level of which 
10 varies consistently as a result of the condition under study. The remainder of the 

variance may be due to differences in the levels of other molecules which give peaks in 
that integral region but which are unrelated to the condition understudy (e.g.. individual 
to individual differences such as dietary factors, age, gender, etc.). 

15 The methods described herein, which employ pattern recognition techniques, permit 
identification of that NMR peak intensity which is related to the condition under study, 
even though only a small part of the variance in a spectral region (bucket) may be 
related to the condition under study. The identification power is enhanced by the 
application of data filtering techniques (e.g., orthogonal signal conrectlon, OSC) which 

20 can lower the influence of buckets with variance unrelated to the condition of interest. 
Actual identification of the molecular biomaricers contributing to significant buckets is 
canied out by reexamination of the original NMR spectra by NMR experts, and could 
involve additionai NMR spectroscopic experiments such as 2-dimensional NMR 
spectroscopy; separation of putative substances and their identification using 

25 HPLC-NMR-MS; addition of authentic substance to the sample and re-measuring the 
NMR spectrum, checking for coinddence of NMR peaks; etc. 

For example, in NMR spectra of blood plasma, in the region around 5 1.2-1.3, a number 
of peaks appear, all of which will contribute to the intensity in those buckets labelled 

30 6 1 .30 (e.g., the chemical shift region 5 1 .32-1 .28). 5 1 .26 (e.g., the region 6 1 .28-1 .24), 
and 6 1.22 (e.g., the region 6 1.24-1 .20). Given the bucket width of 0.04 ppm (i.e.. 24 Hz 
at 600 MHz), tiie wings of the lorentzian lines of tiie NMR resonances will have 
contributions In most or all of these buckets even though the peak maximum appears in 
a sinqle bucket The two main broad NMR peak envelopes in this region of the spectrum 

35 have been assigned to the long chain methylene groups of the fatty acyl chains of 
lipoproteins, and In addition ttiere are a number of small molecule metabolites which 
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have NMR resonances in this region, some of which have been assigned. See, e.g., 
Nicholson et al, 1995. These include the methyl resonances of lactate (a doublet at 5 
1.33), threonine (a doublet at 5 1.32), fucose (a doublet at 5 1.31), in some cases 
3-hydroxybutyrate (a doublet at 6 1 .20) and part of the methylene resonance of 
5 isoleucine (a multiplet at 5 1 .28). The two overlapping lipoprotein peal<s have been 
assigned as mainly VLDL at 5 1 .29 and mainly LDL at 5 1 .25. However both of these 
signals are asymmetric in appearance and are comprised of a number of overlapping 
resonances. By examination of the NMR spectra of individual lipoprotein fractions, it 
has been possible to use mathematical deconvolution techniques to show that this 

10 composite envelope in the 6 1 .3-1 .2 region is comprised of two bands from VLDL, 3 
bands from LDL and 2 bands from HDL. See, e.g., IW. Ala-Korpela, Progress in HMR 
Spectroscopy, 27, 475-554 (1995)). In fact, the inventors have shown that the variance 
in the spectral intensity in the bucl<et at 5 1.30 is only weakly correlated with the LDL 
level measured independently for a panel of 100 patients. The correlation coefficient (r) 

1 5 between the level of LDL as measured by a conventional method and the bucket 

intensity at 5 1.30 in the NMR spectra of the same samples, is only 0.45. Therefore, the 
changes in the concentration of LDL over the samples in this panel of 100 patients only 
accounts for about 20% of the variance in this bucket intensity, since variance is 
proportional to r^. Thus the variance in the intensity in the 5 1 .30 bucket, over the 

20 sample population, contains much more information than solely the variance In the LDL 
concentration. The methods the present invention pemilt the determination and 
exploitation of such of the additional, until now hidden, information. 

Furthermore, the methods can be applied to achieve classification into multiple 
25 categories on the basis of a single dataset (e.g.. an NMR spectrum for a single sample). 
Due to the very high data density of the input dataset, the analysis method can 
separately (i.e., in parallel) or sequentially (i.e., In series) perfonn multiple classifications. 
For example, a single blood sample could be used to detenmine (e.g., diagnose) the 
presence or absence of several, or indeed, many, (e.g., unrelated) conditions or 
30 diseases. 

Thus, one aspect of the present invention pertains to improved methods for the analysis 
of chemical, biochemical, and biological data, for example spectra, for example, nuclear 
magnetic resonance (NMR) and other types of spectra. 



35 
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These techniques have been applied to the analysis of blood serum In the context of 
atherosclerosis/coronary heart disease. For example, the metabonomic analysis can 
5 distinguish between individuals with and without atherosclerosis/coronary heart disease. 
Novel diagnostic biomarkers for atherosclerosis/coronary heart disease have been 
identified, and associated methods for diagnosis have been described. 

Methods of Classifvina, Diagnosing 

10 

One aspect of the present invention pertains to a method of classifying a sample, as 
described herein. 

One aspect of the present Invention pertains to a method of classifying a subject by 
15 cfassffying a sample from said subject, wherein said method of classifying a sample is as 
described herein. 

One aspect of the present invention pertains to a method of diagnosing a subject by 
classifying a sample from said subject, wherein said method of classifying a sample is as 
20 described herein. 

Classifying a Sample: Bv NMR Spectral Intensity 

One aspect of the present invention pertains to a method of classl^ng a sample, said 
25 method comprising the step of relating NMR spectral intensity at one or more 
predetennined diagnostic spectral windows for said sample with a predetermined 
condition. 

One aspect of the present invention pertains to a method of classifying a sample from a 
30 subject, said method comprising the step of relating NMR spectral intensity at one or 
more predetennined diagnostic spectral windows for said sample with a predetermined 
condition of said subject 

One aspect of the present Invention pertains to a method of classifying a sample, said 
35 method comprising the step of relating NMR spectral intensity at one or more 
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predetermined diagnostic spectral windows for said sample with the presence or 
absence of a predetermined condition. 

One aspect of the present invention pertains to a method of classifying a sample from a 
5 subject, said method comprising the step of relating HMR spectral intensity at one or 
more predetennined diagnostic spectral windows for said sample with the presence or 
absence of a predetennined condition of said subject. 

One aspect of the present invention pertains to a method of classifying a sample, said 
10 method comprising the step of relating a modulation of NIVIR spectral intensity, relative to 
a control value, at one or more predetermined diagnostic spectral windows for said 
sample with a predetermined condition. 

One aspect of the present invention pertains to a method of classifying a sample from a 
1 5 subject, said method comprising the step of relating a modulation of NMR spectral 
intensity, relative to a control value, at one or more predetennined diagnostic spectral 
windows for said sample with a predetennined condition of said subject. 

One aspect of the present invention pertains to a method of classifying a sample, said 
20 method comprising the step of relating a modulation of NMR spectral intensity, relative to 
a control value, at one or more predetermined diagnostic spectral v\rfndows for said 
sample with the presence or absence of a predetermined condition. 

One.aspect of the present invention pertains to a method of classifying a sample from a 
25 subject, said method comprising the step of relating a modulation of NMR spectral 
intensity, relative to a control value, at one or more predetennined diagnostic spectral 
windows for said sample with the presence or absence of a predetennined condition of 
said subject. 

30 ClassHvina a Sublect: Bv NMR Spectral Intensitv 

One aspect of the present invention pertains to a method of classifying a subject, said 
method comprising the step of relating NMR spectral intensity at one or more 
predetennined diagnostic spectral windows for a sample from said subject with a 
35 predetermined condition of said subject. 



wo 02/086500 



-30. 



PCT/GB02/018S4 



One aspect of the present invention pertains to a method of dasslfying a subject, said 
method comprising the step of relating NMR spectral intensity at one or more 
predeteimlned diagnostic spectral windows for a sample from said subject with the 
presence or absence of a predetermined condition of said subject. 

One aspect of the present invention pertains to a method of classifying a subject, said 
method comprising the step of relating a modulation of NMR spectral intensity, relative to 
a control value, at one or more predetennined diagnostic spectral windows for a sample 
from said subject with a predetermined condition of said subject. 

One aspect of the present invention pertains to a method of classifying a subject, said 
method comprising the step of relating a modulation of NMR spectral intensity, relative to 
a control value, at one or more predetennined diagnostic spectral windows for a sample 
from said subject with the presence or absence of a predetemilned condition of said 
subject. 

Diagnosing a Subject: Bv NMR Spectral Intensitv 

One aspect of the present invention pertains to a method of diagnosing a predetermined 
condition of a subject, said method comprising the step of relating NMR spectral intensity 
at one or more predetennined diagnostic spectral windows for a sample from said 
subject with said predetennined condition of said subject. 

One aspect of the present Invention pertains to a method of diagnosing a predetennined 
condition of a subject, said method comprising the step of relating NMR spectral intensity 
at one or more predetennined diagnostic spectral windows for a sample from said 
subject with the presence or absence of said predetermined condition of said subject 

One aspect of the present invention pertains to a method of diagnosing a predetennined 
condition of a subject, said method comprising the step of relating a modulation of NMR 
spectral intensity, relative to a control value, at one or more predetennined diagnostic 
spectral windows for a sample from said subject witti said predetennined condition of 
said subject 

One aspect of the present invention pertains to a meUiod of diagnosing a predetennined 
condition of a subject, said metiiod comprising ihe step of relating a modulation of NMR 
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spectral intensity, relative to a control value, at one or more predetermined diagnostic 
spectral windows for a sample from said subject with the presence or absence of said 
predetennined condition of said subject. 

5 Ciassifvino a Sample: By Amount of Diagnostic Species 

One aspect of the present invention pertains to a method of classifying a sample, said 
method comprising the step of relating the amount of, or relative amount of one or more 
diagnostic species present in said sample with a predetermined condition. 

10 

One aspect of the present invention pertains to a method of classifying a sample from a 
subject, said method comprising the step of relating the amount of, or relative amount of 
one or more diagnostic species present in said sample with a predetemnined condition of 
said subject. 

15 

One aspect of the present invention pertains to a method of classifying a sample, said 
method comprising the step of relating the amount of, or relative amount of one or more 
diagnostic species present In said sample with the presence or absence of a 
predetermined condition. 

20 

One aspect of the present invention pertains to a method of classifying a sample from a 
subject, said method comprising the step of relating the amount of, or the relative 
amount of, one or more diagnostic species present in said sample with the presence or 
absence of a predetermined condition of said subject 

25 

One aspect of the present invention pertains to a method of classifying a sample, said 
method comprising the step of relating a modulation of the amount of, or relative amount 
of one or more diagnostic species present in said sample, as compared to a control 
sample, with a predetermined condition. 

30 

One aspect of the present invention pertains to a method of classifying a sample from a 
subject, said method comprising the step of relating a modulation of tiie amount of, or 
relative amount of one or more diagnostic species present in said sample, as compared 
to a control sample, with a predetemnined condition of said subject. 

35 
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One aspect of the present Invention pertains to a method of classifying a sample, said 
method comprising the step of relating a modulation of the amount of, or relative amount 
of one or more diagnostic species present in said sample, as compared to a control 
sample, with the presence or absence of a predetemiined condition. 

One aspect of the present invention pertains to a method of classifying a sample from a 
subject, said method comprising the step of relating a modulation of the amount of, or 
relative amount of one or more diagnostic species present in said sample, as compared 
to a control sample, with the presence or absence of a predetemiined condition of said 
subject. 

ClassifvInQ a Subject By Amount of Diaanostic Species 

One aspect of the present Invention pertains to a mettiod of classifying a subject, said 
method comprising the step of relating tiie amount of, or relative amount of one or more 
diagnostic species present In a sample from said subject with a predetermined condition 
of said subject. 

One aspect of tfie present invention pertains to a method of classifying a subject, said 
meUiod comprising the step of relating the amount of, or relative amount of one or more 
diagnostic species present in a sample from said subject with the presence or absence 
of a predetenmined condition of said subject. 

One aspect of the present Invention pertains to a mettiod of classifying a subject, said 
method comprising tiie step of relating a modulation of the amount of, or relative amount 
of one or more diagnostic species present In a sample from said subject, as compared to 
a contax)l sample, with a predetermined condition of said subject 

One aspect of the present invention pertains to a metfiod of dassifying a subject, said 
mettiod comprising ttie step of relating a modulation of the amount of, or relative amount 
of one or more diagnostic species present in a sample from said subject, as compared to 
a contiol sample, witti ttie presence or absence of a predetemiined condition of said 
subject 
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One aspect of the present invention pertains to a method of diagnosing a predetermined 
condition of a subject, said method comprising the step of relating the amount of, or 
5 relative amount of one or more diagnostic species present in a sample from said subject 
with said predetermined condition of said subject. 

One aspect of the present invention pertains to a method of diagnosing a predetermined 
condition of a subject, said method comprising the step of relating the amount of, or 
10 relative amount of one or more diagnostic species present in a sample from said subject 
with the presence or absence of said predetermined condition of said subject. 

One aspect of the present invention pertains to a method of diagnosing a predetermined 
condition of a subject, said method comprising the step of relating a modulation of the 
15 amount of, or relative amount of one or more diagnostic species present in a sample 
from said subject, as compared to a control sample, with said predetermined condition of 
said subject. 

One aspect of the present invention pertains to a method of diagnosing a predetermined 
20 condition of a subject, said method comprising the step of relating a modulation of the 
amount of, or relative amount of one or more diagnostic species present in a sample 
from said subject, as compared to a control sample, with the presence or absence of 
said predetermined condition of said subject. 

25 Classifying a Sample: Bv Mathematical Modelling 

One aspect of the present invention pertains to a method of classification, said method 
comprising the steps at 

(a) forming a predictive mathematical model by applying a modelling method to 
30 modelling data; 

(b) using said model to classify a test sample. 

One aspect of the present invention pertains to a method of classifying a test sample, 
said method comprising the steps of: 
35 (a) forming a predictive mathematical model by applying a modelling method to 

modelling data; 
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Wherein said modelling data comprises a plurality of data sets for mod.,r 
samples of known class; ' °» sets for modelling 

(b) using said model to dassrfv said test «amr,u u . 
said known classes. ^ "^'"^ « '"^"•^^ °f of 

One aspect of the present invention pertains to a mBth«w * . 



15 dass 



(b) using said model with a data set for «om 
using a predictive mathematical model 



25 One 



using a predictive mathematical model- 
'»'^a<i««cll«matheiiiall(!aln)o<l»l- 
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wherein said modelling data comprises at least one data set for each of a plurality 
of modelling samples; 

wherein said modelling samples define a class group consisting of a plurality of 
classes; 

5 wherein each of said modelling samples is of a known class selected from said 

class group; 

with a data set for said test sample to classify said test sample as being a 
member of one class selected from said class group. 

10 Classifying a Subject: Bv Mathematical Modelling 

One aspect of the present invention pertains to a method of classification, said method 
comprising the steps of. 

(a) fomiing a predictive mathematical model by applying a modelling method to 
15 modelling data; 

(b) using said model to classify a subject. 

One aspect of the present invention pertains to a method of classifying a subject, said 
method comprising the steps of: 
20 (a) forming a predictive mathematical model by applying a modelling method to 

modelling data; 

wherein said modelling data comprises a plurality of data sets for modelling 
samples of known class; 

(b) using said model to classify a test sample from said subject as being a 
25 member of one of said known classes, and thereby dassffy said subject. 

One aspect of the present invention pertains to a method of classifying a subject, said 
method comprising the steps of. 

(a) forming a predictive mathematical model by applying a modelling method to 
30 modelling data; 

wherein said modelling data comprises at least one data set for each of a pluralify 
of modelling samples; 

wherein said modelling samples define a class group consisting of a plurality of 
classes; 

35 wherein each of said modelling samples is of a known class selected from said 

class group; and. 
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(b) using said model witli a data set for a test sample from said subject to classify 
said test sample as being a member of one dass selected from said class group, and 
thereby classify said subject. 

One aspect of the present invention pertains to a method of classification, said method 
comprising the step o^ 

using a predictive mathematical model; 

wherein said model is fomied by applying a modelling method to modelling data; 
to classify a subject. 

One aspect of the present invention pertains to a method of classifying a subject, said 
method comprising the step of: 

using a predictive mathematical model 

wherein said model is fbrnied by applying a modelling method to modelling data; 

wherein said modelling data comprises a plurality of data sets for modelling 
samples of known class; 

to classify a test sample from said subject as being a member of one of said 
known classes, and thereby classify said subject. 

One aspect of the present invention pertains to a method of classifying a subject, said 
method comprising the step of: 

using a predictive mathematical model, 

wherein said model is fonned by applying a modelling method to modelling data; 
wherein saM modelling data comprises at feast one data set for each of a pluraFity 
of modelling samples; 

wherein said modelling samples define a dass group consisting of a pluralify of 
dasses; 

wherein each of said modelling samples is of a known dass selected from said 
dass group; 

virith a data set for a test sample from said subject to classify said test sample as 
being a member of one dass selected from said dass group, and thereby dassify said 
subject 
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One aspect of the present invention pertains to a method of diagnosis, said method 
comprising the steps of: 

(a) forming a predictive mathematical model by applying a modelling method to 
modelling data; 

(b) using said model to diagnose a subject 

One aspect of the present invention pertains to a method of diagnosing a predetennined 
condition of a subject, said method comprising the steps of: 

(a) forming a predictive mathematical model by applying a modelling method to 
modelling data; 

wherein said modelling data comprises a plurality of data sets for modelling 
samples of Icnown dass; 

(b) using said model to classify a test sample from said subject as being a 
member of one of said known classes, and thereby diagnose said subject 

One aspect of the present invention pertains to a method of diagnosing a predetennined 
condition of a subject, said method comprising the steps of: 

(a) fonning a predictive mathematical model by applying a modelling method to 
modelling data; 

wherein said modelling data comprises at least one data set for each of a plurality 
of modelling samples; 

wherein said modelling samples define a class group consisting of a plurality of 
classes; 

wherein each of said modelling samples is of a known class selected from said 
class group; and, 

(b) using said model with a data set for a test sample from said subject to classify 
said test sample as being a member of one dass selected from said dass group, and 
thereby diagnose said subject 

One aspect of the present invention pertains to a method of diagnosis, said method 
comprising the step of. 

using a predictive mathematical model; 

wherein said model is fomied by applying a modelling method to modelling data; 
to diagnose a subject. 
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One aspect of the present invention pertains to a method of diagnosing a predetermined 
condition of a subject, said method comprising the step oft 
using a predictive mathematical model; 

wherein said model is fomried by applying a modelling method to modelling data; 

wherein said modelling data comprises a plurality of data sets for modelling 
samples of known class; 

to classify a test sample from said subject as being a member of one of said 
known classes, and thereby diagnose said subject. 

One aspect of the present invention pertains to a method of diagnosing a predetermined 
condition of a subject, said method comprising the step oft 
using a predictive mathematical model; 

wherein said model is fomned by applying a modelling method to modelling data; 

wherein said modelling data comprises at least one data set for each of a pluralify 
of modelling samples; 

wherein said modelling samples define a class group consisting of a plurality of 
classes; 

wherein each of said modelling samples is of a known class selected from said 
class group; 

with a data set for a test sample from said subject to classify said test sample as 
being a member of one class selected from said class group, and thereby diagnose said 
subject. 

Certain Pref en-ed Embodiments 

In one embodiment, said sample is a sample ftx}m a subject, and said predetermined 
condition is a predetermined condition of said subject 

In one embodiment, said test sample is a test sample from a subject, and said 
predetermined condition is a predetermined condition of said subject. 

In one embodiment, said one or more predetermined diagnostic spectral windows are 
associated with one or more diagnostic species. 
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In one embodiment, said relating step involves the use of a predictive mathematical 
model; for example, as described herein. 

The nature of a predictive mathematical model is detemilned primarily by the modelling 
method employed when fomiing that model. 

In one embodiment, said modelling method is a multivariate statistical analysis modelling 
method. 

In one embodiment, said modelling method is a multivariate statistical analysis modelling 
method which employs a pattern recognition method. 

In one embodiment sjaid modelling method is. or employs PCA. 

In one embodiment, said modelling method is, or employs PLS. 

In one embodiment, said modelling method is, or employs PLS-DA. 

In one embodiment, said modelling method includes a step of data filtering. 

In one embodiment, said modelling method includes a step of orthogonal data filtering. 

In one embodiment, said modelling method includes a step of OSC. 

In one embodiment, said model takes account of one or more diagnostic species. 

The precise details of the predictive mathematical model are detemiined primarily by the 
modelling data (e.g., modelling data sets). 

In one embodiment, said modelling data comprise spectral data. 

In one embodiment, said modelling data comprise both spectral data and non-spectral 
data (and is refened to as a "composite data**). 

In one embodiment, said modelling data comprise NMR spectral data. 
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In one embodiment, said modelling data comprise both NMR spectral data and non-NMR 
spectral data. 

In one embodiment, said NMR spectral data comprises NMR spectral data and/or ^'C 
5 NMR spectral data. 

In one embodiment, said NMR spectral data comprises NMR spectral data. 
In one embodiment, said modelling data comprise spectra. 
In one embodiment, said modelling data are spectra. 

In one embodiment, said modelling data comprises a plurality of data sets for modelling 
samples of known dass. 

In one embodiment, said modelling data comprises at least one data set for each of a 
plurality of modelling samples. 

In one embodiment, said modelling data comprises exactly one data set for each of a 
plurality of modelling samples. 

In one embodiment, said using step is: using said model with a data set for said test 
sample to classify said test sample as being a member of one class selected from said 
dass group. 

In one embodiment, each of said data sets comprises spectral data. 

In one embodiment, each of said data sets comprises both spectral data and non- 
spectral data (and is refenred to as a "composite data set"). 

In one embodiment, each of said data sets comprises NMR spectral data. 



In one embodiment, each of said data sets comprises both NMR spectral data and non- 
NMR spectral data. 
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In one embodiment, said NMR spectral data comprises NMR spectral data and/or ^^C 
NMR spectral data. 

In one embodiment, said NMR spectral data comprises NMR spectral data. 

5 

In one embodiment, each of said data sets comprises a spectrum. 

In one embodiment, each of said data sets comprises a NMR spectrum and/or • 
^^C NMR spectrum. 

10 

In one embodiment, each of said data sets comprises a NMR spectrum. 

In one embodiment, each of said data sets is a spectrum. 

15 In one embodiment, each of said data sets is a NMR spectrum and/or "C NMR 
spectrum. 

In one embodiment, each of said data sets is a NMR spectrum. 

20 In one embodiment, said non-spectral data is non-spectral clinical data. 

In one embodiment, said non-NMR spectral data is non-spectral clinical data. 

In one embodiment, said class group comprises classes associated with said 
25 predetermined condition (e.g.. presence, absence, degree, etc.). 

In one embodiment, said class group comprises exactly two classes. 

In one embodiment, said dass group comprises exactly two classes: presence of said 
30 predetennined condition; and absence of said predetennlned condition. 

Classification. ClassHvina. and Classes 

As discussed above, many aspects of the present invention pertain to methods of 
35 classifying things, for example, a sample, a subject, etc. In such methods, the thing is 
classified, that is, it is associated with an outcome, or, more specifically, it is assigned 
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membership to a particular class (le., it is assigned class membership), and is said '1o 
be of," 'lo belong to." "to be a member of," a particular dass. 

Classification is made (i.e., class membership is assigned) on the basis of diagnostic 
5 criteria. The step of considering such diagnostic criteria, and assigning class 

membership, is described by the word "relating." for example, in the phrase "relating 
NMR spectral intensity at one or more predetemiined diagnostic spectral windows for 
said sample (i.e.. diagnostic criteria) with the presence or absence of a predetemiined 
condition (\.e,, class membership)," 

10 

For example, "presence of a predetemiined condition" is one class, and "absence of a 
predetermined condition" is another class; in such cases, classification (i.e.. assignment 
to one of these classes) is equivalent to diagnosis. 

15 Samoles 

As discussed above, many aspects of the present invention pertain to methods which 
involve a sample, e.g., a particular sample understudy C'study sample"). 

20 In general, a sample may be in any suitable fonm. For methods which involve spectra 
obtained or recorded for a sample, the sample may be in any fomi which is compatible 
with the particular type of spectroscopy, and therefore may be, as appropriate, 
homogeneous or heterogeneous, comprising one or a combination of. for example, a 
gas, a liquid, a liquid crystal, a gel, and a solid. 

25 

Samples which originate from an organism (e.g., subject, patient) may be in vivo; that is, 
not removed from or separated from the organism- Thus, In one embodiment, said 
sample is an in vivo sample. For example, the sample may be circulating blood, which is 
"probed" In situ, in vivo, for example, using NMR methods. 

30 

Samples which originate from an organism may be ex vivo; that is, removed from'or 
separated from the organism (e.g.. an ex vivo blood sample, an ex vivo urine sample). 
Thus, in one embodiment, said sample is an ex vivo sample. 

35 In one embodiment, said sample is an ex vivo blood or blood-<lerived sample. 
In one embodiment, said sample is an ex vivo blood sample. 
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In one embodiment, said sample is an ex vivo plasma sample. 
In one embodiment, said sample is an ex vivo serum sample. 
In one embodiment, said sample is an ex vivo urine sample. 

5 In one embodiment, said sample is removed from or separated from an/said organism, 
and is not returned to said organism (e.g., an ex vivo blood sample, an ex vivo urine 
sample). 

In one embodiment, said sample is removed from or separated from an/said organism. 
10 and is returned to said organism (i.e., "in transit") (e.g., as with dialysis methods). Thus, 
in one embodiment, said sample is an ex vivo in transit sample. 

Examples of samples include: 

a whole organism (living or dead, e.g., a living human); 
15 a part or parts of an organism (e.g., a tissue sample, an organ); 

a pathological tissue such as a tumour; 

a tissue homogenate (e.g. a liver microsome firaction); 

an extract prepared from a organism or a part of an organism (e.g.. a tissue 
sample extract, such as perchloric acid extract); 
20 an infusion prepared from a oi^anism or a part of an organism (e.g., tea, Chinese 

traditional herbal medicines); 

an in vitro tissue such as a spheroid; 

a suspension of a particular cell type (e.g. hepatocytes); 

an excretion, secretion, or emission finom an organism (especially a fluid); 
25 material which is administered and collected (e.g., dialysis fluid); 

material which develops as a function of pathology (e.g.. a cyst, blisters); and, 

supernatant from a cell culture. 

Examples of fluid samples include, for example, blood plasma, blood semm, whole 
30 blood, urine, (gall bladder) bile, cerebrospinal fluid, milk, saliva, mucus, sweat, gastric 
juice, pancreatic juice, seminal fluid, prostatic fluid, seminal vesicle fluid, seminal plasma, 
amniotic fluid, foetal fluid, follicular fluid, synovial fluid, aqueous humour, ascite fluid, 
cystic fluid, blister fluid, and cell suspensions; and extracts thereof. 



35 



Examples of tissue samples include liver. Icidney, prostate, brain, gut. blood, blood cells, 
skeletal muscle, heart muscle, lymphoid, bone, cartilage, and reproductive tissues. 
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Still other examples of samples Include air (e.g.. exhaust), water (e.g., seawater, 
groundwater, wastewater, e.g., from factories), liquids from the food industry (e.g. juices, 
wines, beers, other alcoholic drinks, tea, milk), solid-like food samples (e.g. chocolate, 
5 pastes, fruit peel, fruit and vegetable flesh such as banana, leaves, meats, whether 
cooked or raw, etc.). 

A few preferred samples are discussed below. 
10 Blood. Plasma. Serum 

Blood is the fluid that circulates in the blood vessels of the body, that is. the fluid that is 
circulated through the heart, arteries, veins, and capillaries. The function of the blood 
and the circulation is to sennce the needs of other tissues: to transport oxygen and 
1 5 nutrients to the tissues, to transport carbon dioxide and various metabolic waste 
products away, to conduct honnones from one part of the body to another, and in 
general to maintain an appropriate environment in all tissue fluids for optimal survival 
and function of the cells. 

Blood consists of a liquid component, plasma, and a solid component, cells and fomied 
elements (e.g., erythrocytes, leukocytes, and platelets), suspended within ft. 
Erythrocytes, or red blood cells account for about 99.9% of the cells suspended in 
human blood. They contain hemoglobin which Is involved in the transport of oxygen and 
cartjon dioxide. Leukocytes, or white blood cells, account for about 0. 1 % of the cells 
suspended in human blood. They play a role in the body's defense mechanism and 
repair mechanism, and may be classified as agranular or granular. Agranular leukocytes 
Include monocytes and small, medium and large lymphocytes, with small lymphocytes 
accounting for about 20-25% of the leukocytes in human blood. T cells and B cells are 
important examples of lymphocytes. Three classes of granular leukocytes arB known, 
neutrophils, eosinophils, and basophils, with neutrophils accounting for about 60% of the 
leukocq^es in human blood. Platelets (/.a, thrombocytes) are not cells but small spindle- 
shaped or rodlike bodies about 3 microns in length which occur in large numbers in 
circulating blood. Platelets play a major role in clot fbmiatlon. 

Rasma is the lk]uid component of blood. It serves as the primary medium for the 
transport of materials among cellular, tissue, and organ systems and their various 
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external environments, and It is essential for the maintenance of normal hemostasis. 
One of the most important functions of many of the major tissue and organ systems is to 
maintain specific components of plasma within acceptable physiological limits. 

Plasma is the residual fluid of blood which remains after removal of suspended cells and 
formed elements. Whole blood is typically processed to removed suspended cells and 
formed elements (e.g., by centrifugation) to yield blood plasma. Serum is the fluid which 
is obtained after blood has been allowed to clot and the clot removed. Blood serum may 
be obtained by fonning a blood clot (e.g., optionally initiated by the addition of thrombin 
and calcium ion) and subsequently removing the clot (e.g., by centrifugation). Serum 
and plasma differ primarily in their content of fibrinogen and several components which 
are removed in the clotting process. Plasma may be effectively prevented from clotting 
by the addition of an anti-coagulant {e.g., sodium citrate, heparin, lithium heparin) to 
pemiit handling or storage. Plasma Is composed primarily of water (approximately 90%), 
with approximately 7% proteins, 0.9% inorganic salts, and smaller amounts of 
carbohydrates, lipids, and organic salts. 

The tenn "blood sample," as used herein, pertains to a sample of whole blood. 

The temi "blood-derived sample," as used herein, pertains to an ex vivo sample derived 
from the blood of the subject under study. 

Examples of blood and blood-derived samples include, but are not limited to, whole 
blood (WB), blood plasma (including, e.g., fresh frozen plasma (FFP)), blood serum, 
blood fractions, plasma fractions, serum fractions, blood fractions comprising red blood 
cells (RBC), platelets (PLT), leukocytes, etc., and cell lysates including fractions thereof 
(for example, cells, such as red blood cells, white blood cells, etc.. may be harvested and 
iysed to obtain a cell lysate). 

Methods for obtaining, preparing, handling, and storing blood and blood-derived samples 
(e.g., plasma, serum) are well known in the art. Typically, blood is collected from 
subjects using conventional techniques (e.g.. from the ante-cubital fossa), typically pre- 
prandially. 

For use in the methods described herein, the method used to prepare the blood fraction 
(e.g., serum) should be reproduced as carefully as possible from one subject to the next. 
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It fs Important that the same or similar procedure be used for all subjects, it may be 
preferable to prepare serum (as opposed to plasma or other blood fractions) for two 
reasons: (a) the preparation of serum is more reproducible from indiWdual to Individual 
than the preparation of plasma, and (b) the preparation of plasma requires the addition of 
5 anticoagulants (e.g., EDTA, citrate, or heparin) which will be visible in the NMR 
metabonomic profile and may reduce the data density available. 

Atypical method for the preparation of serum suitable for analysis by the methods 
described herein is as follows: 10 mL of blood is drawn from the antecubital fossa of an 

1 0 Individual who had fasted ovemight. using an 1 8 gauge butterfly needle. The blood is 
Immediately dispensed Into a polypropylene tube and allowed to clot at room 
temperature for 3 houre. The clotted blood is then subjected to centrifugation (e.g.. 
4,500 X g for 5 minutes) and the semm supernatant removed to a dean tube. If 
necessary, the centrifugation step can be repeated to erisure the serum is efficiently 

15 separated finom the dot The serum supematant may be analysed "fresh" or H may be 
stored frozen for later analysis. 

A typical method for the preparation of plasma suitable for analysis by the methods 
described herein is as follows: High quality platelet-poor plasma is made by drawing the 

20 blood using a 1 9 gauge butterfly needle without the use of a tourniquet from the 
anetcubital fossa. The first 2 mL of blood drawn is discarded and the remainder is 
rapidly mixed and allquoted into Oiatube H anticoagulant tubes (Becton Dickinson). After 
gentle mixing by invereion the antlcoagulated blood te cooled on ice for 15 minutes then 
subjected to centrifugation to peflet the cells and platelets (approximately 1 ,200 x g for 

25 15 minutes). The platelet poor plasma supemantent is carefully removed, drawing off 
the middte third of the supematant and discarding the upper third (virtiich may contain 
floating platelets) and the lower ftird which is too dose to flie readily disturtjed platelet 
layer on the top of the cell pellet. The plasma may then be allquoted and stored frozen 
at -20'C or colder, and then thawed when required for assay. 



30 



Samples may be analysed immediately ("fresh"), or may be frozen and stored (e.g., at - 
80X) (Tresh frozen") for ftrture analysis. If frozen, samples are completely thawed prior 
to NMR analysis. 
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In one embodiment, said sample is a blood sample or a blood-derived sample. 
In one embodiment, said sample is a blood sample. 
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In one embodiment, said sample is a blood plasma sample. 
In one embodiment, said sample is a blood serum sample. 

Urine 

5 

The composition of urine Is complex and highly variable both between species and within 
species according to lifestyle. A wide range of organic acids and bases, simple sugars 
and polysaccharides, heterocycles. polyols, low molecular weight proteins and 
polypeptides are present together with inorganic species such as Na*. K*. Ca^*, Mg^*, 
1 0 HCO3', S04^' and phosphates. 

The temi "urine," as used herein, pertains to whole (or intact) urine, whether in vivo (e.g., 
foetal urine) or ex vivo, e.g., by excretion or catheterisatlon. 

15 The term "urine-derived sample," as used herein, pertains to an ex vivo sample derived 
from the urine of the subject under study (e.g., obtained by dilution, concentration, 
addition of additives, solvent- or solid-phase extraction, etc.). Analysis may be 
performed using, for example, fresh urine; urine which has been frozen and then thawed; 
urine which has been dried (e.g., freeze-dried) and then reconstituted, e.g., with water or 

20 D2O. 

Methods for the collection, handling, storage, and pre-analysis preparation of many 
classes of sample, especially biological samples (e.g., biofluids) are well known in the 
art. See, for example, Lindon et al., 1999. 

25 

In one embodiment said sample is a urine sample or a urine-derived sample. 
In one embodiment, said sample is a urine sample. 

Organisms. Subiects. Patients 

30 

As discussed above, in many cases, samples are, or originate from, or are drawn or 
derived from, an organism (e.g., subject, patient). In such cases, the organism may be 
as defined below. 

35 In one embodiment, the organism is a prokaryote (e.g., bacteria) or a eukaryote (e.g., 
protoctista, fungi, plants, animals). 
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In one embodiment, the organism is a prokaryote (e.g.. bacteria) era eukaryote 
(e.g., protoctista, fungi, plants, animals). 

In one embodiment, the organism is a protoctista, an alga, or a protozoan. 

In one embodiment, the organism is a plant, an angiospenn, a dicotyledon, a 
monocotyledon, a gymnosperm, a conifer, a ginkgo, a cycad, a fern, a horsetail, a 
clubmoss, a liverwort, or a moss. 

In one embodiment, the organism is an animal. 

In one embodiment, the organism is a chordate, an invertebrate, an echinoderm (e.g., 
starfish, sea urchins. britBestars), an arthropod, an annelid (segmented worms) 
(e.g., earthwomis, lugwonns, leeches), a mollusk (cephalopoda {e.g., squids, odopi), 
pelecypods (e.g.. oysters, mussels, dams), gastropods (e.g., snails, slugs)), a nematode 
(round worms), a platyhelminthes (flatworms) (e.g.. planarians. flukes, tapewonns). a 
cnidaria (e.g., jelly fish, sea anemones, corals), or a porifeca (e.g., sponges). 

In one embodiment, the organism is an arthropod, an Insect (e.g.. beetles, butterflies, 
moths), a chilopoda (cenfipedes). a diplopoda (millipedes), a crustacean (e.g., shrimps, 
crabs, lobsters), or an arachnW (e.g., spiders, scorpions, mites). 

In one embodiment, the organism is a chordate, a vertebrate, a mammal, a bird, a reptile 
(e.g., snakes, lizards, crocodiles), an amphibian (e.g.. frogs, toads), a bony fish (e.g., 
salmon, plaice, eel. lungfish), a cartilaginous fish (e.g., sharks, rays), or a jawless fish 
(e.g., lampreys, hagfish). 

In one embodiment the organism (e.g., subject, patient) is a mammal. 

In one embodiment, the organism (e.g.. subject, patient) is a placental mammal, 
a marsupial (e.g., kangaroo, wombat), a monotreme (e.g., duckbilled platypus), a rodent 
(e.g., a guinea pig, a hamster, a rat, a mouse), murine (e.g., a mouse), a lagomorph 
(e.g., a rabbit), avian (e.g., a bird), canine (e.g., a dog), feline (e.g., a cat), equine (e.g.. a 
horse), pordne (e.g., a pig), ovine (e.g.. a sheep), bovine (e.g., a cow), a primate, simian 
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(e.g., a monkey or ape), a monkey (e.g.. marmoset, baboon), an ape (e.g., gorilla, 
chimpanzee, orangutang, gibbon), or a human. 

Furthemiore, the organism may be any of its forms of development, for example, 9 
5 spore, a seed, an egg, a larva, a pupa, or a foetus. 

In one embodiment, the organism (e.g., subject, patient) is a human. 

The subject (e.g., a human) may be characterised by one or more criteria, for example, 
10 sex, age (e.g., 40 years or more, 50 years or more. 60 years or more, etc.). ethnicity, 
medical history, lifestyle (e.g.. smoker, non-smoker), hormonal status (e.g.. pre- 
menopausal, post-menopausai), etc. 

The term "population," as used herein, refers to a group of organisms (e.g.. subjects. 
15 patients). If desired, a population (e.g., of humans) may be selected according to one or 
more of the criteria listed above. 

Conditions 

20 As discussed above, many methods of the present invention involve assigning class 
membership, for example, to one of one or more classes, for example, to one of the two 
classes: (i) presence of a predetermined condition, or (ii) absence of a predetemiined 
condition. 

25 A condition is "predetermined" In the sense that it is the condition In respect to which the 
invention is practised; a condition is predetermined by a step of selecting a condition for 
considering, study, etc. 

As used herein, the term "condition" relates to a state which is, in at least one respect, 
30 distinct from the state of nomnality, as determined by a suitable control population. 

A condition may be pathological (e.g., a disease) or physiological (e.g., phenotype, 
genotype, fasting, water load, exercise, honnonal cydes. e.g., oestrus, etc.). 

35 Included among conditions is the state of "at risk or a condition, "predisposition towards 
a" condition, and the like, again as compared to the state of normality, as detemDined by 
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a suitable control population. In this way. osteoporosis, at rjsl< of osteoporosis, and 
predisposition towards osteoporosis are all conditions (and are also conditions 
associated with osteoporosis). 

6 Where the condition is the state of "at risk of." "predisposition towaixis," and the like, a 
method of diagnosis may be considered to be a method of prognosis. 

in this context, the phrases "at risk of." "predisposition towarts," and the lil<e. indicate a 
probability of being classified/diagnosed (or being able to be classified/diagnosed) with 
10 the predetennined condition which is greater (e.g.. 1 .5x. 2x. 5x. lOx. etc.) than for the 
corresponding control. Often, a time period (e.g., within the next 5 years, 10 years. 20 
years, etc.) is assodated with the probability. For example, a subject who is 2x more 
likely to be diagnosed with the predetennined condition within the next 5 years, as 
compared to a suitable control, is "at risk or that condition. 

15 

Included among conditions is the degree of a condition, for example, the progress or 
phase of a disease, or a recovery therefrom. For example, each of different states in the 
progress of a disease, or in the recovery from a disease, are themselves conditions. In 
this way. the degree of a condition may refer to how temporally advanced the condition 
20 Is. Another example of a degree of a condition relates to its maximum severity, e.g., a 
disease can be dassiiied as mild, moderate or severe). Yet another example of a 
degree of a condition relates to the nature of the condition (e.g., anatomical site, extent 
of tissue involvement etc.). 

25 Atheroscter osis/Coronan^ heart disease 

In the present invention, said predetennined condition is assodated with 
atherosderosis/coronary heart disease. 



30 



Coronary heart disease (CHD) is a major cause of mortality and morbidity in developed 
countries, affecting as many as 1 in 3 individuals before the age of 70 yeara (see. e.g., 
I^nel et al., 1974). 

Atherosclerosis (commonly called "hardening of the arteries"), is a vascular condition in 
which arteries narrow, tt is assodated with deposits of oxidised Bpid on the walls of 
arteries, which accumulate and eventually harden Into plaques. The arteries become 
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calcifiecl and lose elasticity, and as this process continues, blood flow slows. It can affect 
any artery. Including, e.g., the coronary arteries. 

In order to perform the arduous task of pumping biood. the heart muscle needs a 
5 plentiful supply of oxygen-rich blood, which is provided through a network of coronary 
arteries. Coronary artery disease is the end result of atherosclerosis, preventing 
sufficient oxygen-rich blood from reaching the heart Oxygen deprivation in vital cells 
(called ischaemfa) causes injury to the tissues of the heart. If the artery becomes 
completely blocked, damage becomes so extensive that cell death, a heart attack, 
10 occurs. A heart attack usually occurs when a blood clot forms completely sealing off the 
passage of blood in a coronary artery. This typically happens when the plaque itself 
develops fissures or tears; blood platelets adhere to the site to seal off the plaque and a 
blood clot (thrombus) forms. 



1 5 Angina Is not a disease itself but is the primary symptom of coronary artery disease, it is 
typically experienced as chest pain, which can be mild, moderate, or severe, but is often 
reported as a dull, heavy pressure that may resemble a crushing object on the chest. 
Pain often radiates to the neck, jaw, or left shoulder and arm. Less commonly, patients 
report mild burning chest discomfort, sharp chest pain, or pain that radiates to the right 

20 arm or back. Sometimes a patient experiences shortness of breath, fatigue, or 
palpitations instead of pain. Classic angina is precipitated by exertion, stress, or 
exposure to cold and is relieved by rest or administration of nitroglycerin. Angina can 
also be precipitated by large meals, which place an immediate demand upon the heart 
for more oxygen. The intensity of the pain dges not always relate to the severity of the 

25 medical problem. Some people may feel a caishing pain from mild ischemia, while 

others might experience only mild discomfort from severe ischemia. Some people have 
also reported a higher sensitivity to heat on the skin with the onset of angina. 



Although atherosclerosis Is far and away the leading cause of angina, other conditions 
30 can impair the delivery of oxygen to the heart muscle and cause pain. Such conditions 
Include: spasm in the coronary artery, abnonnalities of the heart muscle itself, 
hyperthyroidism, anaemia, vasculitis (a group of disorders that cause inflammation of the 
blood vessels), and, in rare cases, exposure to high altitudes. Many conditions may 
cause chest pains unrelated to heart or blood vessel abnormalities. High on the list are 
35 anxiety attacks, gastrointestinal disorders (gallstone attacks, peptic ulcer disease, hiatal 
hernia, heartburn), lung disorders (asthma, blood dots, bronchitis, pneumonia, collapsed 
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lung), and problems aRiecting the ribs and chest muscles (injured muscles, fractures, 
arthritis, spasms, infections). 

Stable angina can be extremely painful, but its occunence is predictable; it is usually 
triggered by exertion or stress and relieved by rest. Stable angina responds well to 
medical treatment. Any event that increases oxygen demand can cause angina, 
including exercise, cold weather, emotional tension, and even large meals. Angina 
attacks can occur at any time during the day. but a high proportion seems to take place 
between the hours of 6:00 AM and noon. 

Unstable angina is a much more serious situation and is often an intennediate stage 
between stable angina and a heart attack. A patient is usually diagnosed with unstable 
angina under the following conditions: pain awakens a patient or occurs during rest, a 
patient who has never experienced angina has severe or moderate pain during mikl 
exertion (walking two level blocks or climbing one flight of staire). or stable angina has 
progressed in severity and frequency within a two-month period Medications are.less 
effective In relieving pain of unstable angina. 

Another type of angina, called variant or Prinzmetal's angina, is caused by a spasm of a 
coronary artery. It almost always occurs when the patient is at rest Irregular heartbeats 
are common, but thff pain is generally relieved Immediately with treatmeni 

Some people with severe coronary artery disease do not experience angina pain, a 
condition known as silent Ischaemla. which some experts attribute to abnomial 
processing of heart pain by the brain. 

Coronary artery disease (premature blockage of one or more of the coronary arteries) is 
the leading killer in the USA of both men and women, responsible for over 475.000 
deaths in 1996. On the positive side, mortality rates from coronary artery disease have 
significantly declined In industrialised countries over the past few decades, although they 
are on the rise in developing nations. When the necessary fifestyle changes are enacted 
in combination with appropriate medical or surgical treatments, a pereon suffering angina 
and heart disease has a good chance of living a nomial life. Experts have believed, for 
example, that unstable angina indicates a very high risk for death after a heart attack, but 
a recent study indteated that after the first year of treatment such a patienfs risk for 
death is only 1.2% above the risk in the normal populabon. Much evidence exists. In 
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fact, that onset of angina less than 48 hours before a heart attack is actually protective, 
possibly by conditioning the heart to resist the damage resulting from the attack. In one 
study, people without chest pain experienced much higher complication and mortality 
rates than those with pain. 

5 

Angiographic x-ray imaging ("angiography") has grown into its own classification of x-ray 
imaging over time. The basic principal is the same as a conventional x-ray scan: x-rays 
are generated by an x-ray tube and as they pass through the body part being imaged, 
they are attenuated (weakened) at different levels. These differences in x-ray 
10 attenuation are then measured by an image intensifier and the resulting image is picked 
up by a TV camera. In modem angiography systems, each frame of the analogue TV 
signal is then converted to a digital frame and stored by a computer in memory and/or on 
hard magnetic disk. These x-ray "movies" can l5e viewed in real time as the angiography 
is being performed, or they can be reviewed later using recall from digital memory. 

15 

During angiography, physicians inject streams of contrast agents or dyes into the area of 
interest using catheters to create detailed images of the blood vessels in real time. 
During the angiographic procedure, physicians can guide a catheter into the area of 
interest to remove stenoses (blockages) of blood vessels. Patients with blockages of the 
20 major leg vessels, for instance, can have nearty total recovery after such angioplasty is 
performed to remove the constriction. 

X-ray angiography is performed to specifically image and diagnose diseases of the blood 
vessels of the body, including the brain and heart. Traditionally, angiography was used 

25 to diagnose pathology of these vessels such as blockage caused by plaque build-up. 
However In recent decades, radiologists, cardiologists and vascular surgeons have used 
the x-ray angiography procedure to guide minimally invasive surgery of the blood vessels 
and arteries of the heart. In the last several years, diagnostic vascular images are often 
made using magnetic resonance imaging, computed x-ray tomography or ultrasound and 

30 whilst x-ray angiography is reserved for therapy. Conventional x-ray angiography has a 
lead role in the detection, diagnosis and treatment of heart disease, heart attack, acute 
stroke and vascular disease which can lead to stroke. 

Most conventional x-ray angiography procedures are similar. Patient preparation 
35 involves removing clothing and jewellery and wearing a patient gown. In all cases, 
angiography requires that an intravenous contrast agent is administered. For 
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Interventional or therapeutic angiography, a small incision is made In the groin or ami so 
that a catheter can be inserted during the study. The patient is positioned on the 
examination table by the technologist so that the anatomy of interest (e.g. coronaiy 
arteries) is in the proper field of view between the x-ray tube and image intensifier. TTie 
5 technologist and radiologist remain at table-side during the procedure to operate the 
angiography system and wori< with the catheters, contrast injectors and related devices. 
Typically the patient simply needs to relax and stay calm during angiography. Some 
angiography procedures can talce up to two hours while other procedures take less than 
an hour. Once the procedure is finished, the patient will be given a period of time to 
1 0 recover. During this period, the patienf s case is reviewed on film or monitor. Depending 
on the type of angiographic procedure and the patient's medical condition, an inpatient 
recovery may be required or the patient may be released after a short time. In some 
cases, more images may need to be taken. 

5 Using angiography to see inside the body, doctors can repair blood vessels without the 
use of a scalpel and fully invasive surgical methods. Advances in the design and use of 
catheters (small tubes that are guided into the blood vessels through tiny incisions in the 
groin area or upper ami) allow physicians to perfomi veiy complex therapeutic 
procedures from within the blood vessel. Pathology of the blood vessels such as plaque 
build up in the anns and legs, neck and brain, and heart can be treated using a variety of 
interventional angiographic surgery (e.g. coronary angtoplasty). 

Although coronary angiography is the gold standard for CHD (including detecHon, 
diagnosis, and treatment), this technique is not without its problems. Coronaiy 
angiography is an extremely invasive technkiue and is associated with a morbidity rate of 
1% and a mortality rate of 0.1 %. In addition to the invasive nature of angiography, the 
technique is also very expensive and time-consuming. In the UK. the average cost for 
coronary angiography is approximately £8,000 - £10,000 per case. The disadvantages 
associated with coronary angiography make the technique unsuitable as a routine 
screening procedure. 



Over the past three decades a range of environmental and biochemical risk factors for 
the development of CHD have been identified in cross-sectional studies (see, e.g.. 
iqelsberg et al., 1997). Examples are listed In Table 1-CHD. For example, tobacco 
smoking is assodated with an approximately 2-fold increased risk of CHD (see. e.g., 
KuBer et al.. 1991). Similarly, high levels of cholesterol In large, triglyceride-rich 
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lipoprotein particles (mainly VLDL and LDL) and lower levels of cholesterol in HDL 
particles is well known to be associated with increased risk of CHD (see, e.g., MRFIT 
Research Group, 1986; Despres et al., 2000). 



Table 1-CHD 

Risk Factors for Coronary Heart Disease 


Fixed Risk Factors 


Potentially Changeable Risk Factors 


Strong Association 


Weak Association 


age 


hyperiipidaemia 


personality 


male sex 


cigarette smoking 


obesity 


positive family history 


hypertension 


gout 




diabetes mellitus 


soft water 


• 




lack of exercise 






contraceptive pill 






heavy alcohol intake 



5 

These epidemiological studies have been tremendously useful in a number of ways. 
Firstly, they have underpinned public health policy on a range of issues, discouraging 
tobacco smoking and promoting low cholesterol diets (see. e.g.. Mcllvain et al.. 1992; 
Dolecek et al., 1986). Secondly, they have provided vital clues as to the undertying 



10 molecular mechanisms which cause atherosclerosis and CHD (see, e.g., Ross, 1999). 
For example, once the association between elevated levels of LDL-cholesterol and CHD 
had been identified, it was possible to demonstrate that increased LDL-cholesterol 
actually causes atherosclerosis by reverse genetic techniques in mice (see, e.g., Plump 
et al., 1992; Yokode et al.. 1990; Breslow. 1993). Extending these studies, therapies 

1 5 were then designed on the basis of their ability to lower LDL-cholesterol. These lipid 
lowering therapies have now been shown to be broadly effective in reducing the risk of 
myocardial infarction, even among people with normal levels of LDL-cholesterol. 

However, the risk Actors identified to date from cross-sectional epidemiological studies 
20 are insufficiently powerful to provide a dinlcaily useful diagnosis of CHD. Although 
algorithms have been designed based on a range of risk fectors, such as age, sex, 
lipoprotein levels and blood pressure, which can identify sub-populations at very 
significant excess risk of CHD, even the best of these based on the excellent PROCAM 
study in MOnster. Germany, cannot diagnose the presence of CHD on an individual by 
25 individual basis (see, e.g., Cullen et al., 1998). It is likely that CHD is weakly associated 
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wHh a very large number of environmental, physiological and biochemical variables, and 
as a result even the fuH range of risk factors discovered to date comprise insufficient 
density of data to accurately discriminate CHD patients from healthy controls on an 
individual basis (see, e.g., Isles et al., 2000), 

5 

Recently, there have been technical advances which have allowed datasets to be 
constructed from individuals which have exfremely high data densities. Techniques such 
as genomics (examining the cellular gene expression pattern of thousands of genes 
simultaneously, see. e.g.. Collins et al., 2001), proteomics (examining the cellular 
10 contents of multiple proteins simultaneously, see, e.g., Dutt et al.. 2000) and 

metabonomics (examining the changes in hundreds or thousands of low molecular 
weight metabolites in an Intact tissue or biofluid) offer the prospect of efficiently 
distinguishing individuals with particular disease or toxic states (see, e.g.. Nicholson et 
al., 1999). 

15 

Whereas cunently. a fimi diagnosis of CHD can only be made through application of 
angiography, which is both expensive and invasive, the intrtxluction of metabonomic 
screening, as described herein, would allow diagnosis to be made simply and cheaply on 
the basis of a single blood sample, e.g., a non-invasive diagnosis of CHD. Such 
20 changes would revolutionize the provision of health care for CHD, allowing both 

widespread population screening and efncient targeting of drugs such as statins which, 
while being broadly effective in reducing the risk of myocardial infarction, are difficult to 
target to those most in need of treatment. 

Atherosclero tic Load and Atherosclerotic Conditions 

In one embodiment, the predetermined condition Is related to atherosclerotic load, for 
ejrample, a state of abnonnally high atherosclerotic toad. 

10 The tenns "atherosclerotic toad" and "atherosclerotic burden," as used herein, pertain to 
the total volume of atherosclerotic plaque tissue found throughout the vascular tree of a 
subject Although most direct diagnostic procedures, such as angiography, examine 
only a particular site (e.g.. the coronary arteries), most biochemical tests which depend 
on analysis of the blood are associated with the total atherosclerotic load throughout the 
5 vascular tree. In most cases, however, the presence of atherosclerosis in one organ 
system is Indicative of its presence In others. Thus, sutjects with coronary artery 
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atherosclerosis will, in general, have higher total atherosclerotic load than subjects 
without coronary artery atherosclerosis. The converse Is also true: individuals with high 
total atherosclerotic loads are much more likely to have coronary artery disease than 
individuals with low atherosclerotic loads. Different conditions are associated with the 
5 presence of atherosclerosis in particular arteries, for example, coronary heart disease is 
associated with atherosclerosis, at least in part, in the coronary arteries; strolce is 
associated with atherosclerosis, at least in part, in the carotid arteries. 

In one embodiment, the predetemiined condition is related to an atherosclerotic 
10 condition* 

The term "atherosclerotic condition," as used herein, pertains to a condition associated 
with an abnonmally high atherosclerotic load, as compared to a suitable control 
population. 

15 

Examples of atherosclerotic conditions include, but are not limited to, the following, which 
are organised by the artery system affected or most affected or most relevant: 

Peripheral vascular disease (PVD). This can lead to ischemia in the extremities, leading 
20 to pain, morbidity and in severe cases to amputation. 

Deep vein thrombosis (DVT). This is a common cause of ischemia, often secondary to 
PVD, but may have other causes (e.g., long periods of inactivity on long-haul flights). 

25 Diabetes macrovascular atherosclerosis. This is one of the most common complications 
of diabetes. It may also include complications at specific vascular beds, most commonly 
diabetic retinopathy and diabetic nephropathy, where the vascular beds of the eye and 
kidney, respectively, are particulariy badly affected. 

30 Coronary artery disease (CAD). This is the most common cause of heart attacks, and is 
atherosclerosis of one or more major coronary artery. 

Angina. This describes the specific symptoms of CAD, and can be stable or unstable. 
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Ischemic stroke. The most common cause of stroke is ischemia secondary to 
atherosclerosis of the major arteries supplying the brain. This includes all forms of 
stroke except haemontiagic stroke. 

5 Transient ischemic attack syndrome (TIA). This is the brain equivalent of angina, in 
which the blood supply to the brain is reduced - not sufficiently to cause infarction (tissue 
death), but sufficiently to lead to symptoms resembling epilepsy. 

Renal hypertension. One of the most common causes of hypertension is atherosclerosis 
10 of the renal artery, which reduces kidney perfusion and upsets the blood volume 
regulatory mechanisms. 

Marfan Syndrome. A relatively common inherited monogenic disorder due to mutation in 
the fibrillin genes, which results in vascular changes which can resemble atherosclerosis. 

MoyaMoya disease. This condition is similar to Marfan syndrome, but affects 
predominantly the brain vasculature. 

Monkeburg Syndrome. A rare monogenic disorder in which vascular calcification, similar 
to that seen in atherosclerosis, affects the aorta. This condition resembles Marfan 
syndrome and can lead to dissection of the vessel and death. 

NMR SpectTDscopy 

As discussed above, many aspects of the present invention pertain to methods which 
employ NMR spectra, or data obtained or derived from NMR spectra. 

The principal nucleus studied in biomedical NMR spectroscopy is the proton or 
nucleus. This is the most sensitive of all naturally occum'ng nuclei. The chemical shift 
range is about 10 ppm for organic molecules. In addition ^^C NMR spectroscopy using 
either the naturally abundant 1.1% ^^C nuclei or employing isotopic enrichment is useful 
for identifying metabolites. The chemical shift range is about 200 ppm. Other nuclei 
find special application. These include ^^N Qn natural abundance or enriched), ^ V for 
studies of drug metabolism, and ^^P for studies of endogenous phosphate biodiemistry 
either in vitro or In vivo. 
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in order to obtain an NMR spectrum, it is necessary to define a "pulse program". At its 
simplest, this is application of a radio-frequency {RF) pulse followed by acquisition of a 
free induction decay (FID) - a time-dependent oscillating, decaying voltage which is 
digitised in an analog-digital converter (ADC). At equilibrium, tiie nuclear spins are 
5 present in a number of quantum states and the RF pulse disturbs this equilibrium. The 
FID is the result of the spins returning towards the equilibrium state. It is necessary to 
choose the length of the pulse (usually a few microseconds) to give the optimum 
response. 

1 0 This, and other experimental parameters are chosen on the basis of knowledge and 
experience on the part of the spectroscopist. See, for example, T.D.W. Claridge, High- 
Resolution NMR Techniques in Organic Chemistry: A Practical Guide to Modem NMR 
for Chemists. Oxford University Press, 2000. These are based on the observation 
frequency to be used, the known properties of the nucleus under study (le., the 

1 5 expected chemical shift range will detemnine the spectral width, the desired peak 
resolution detemiines the number of data points, the relaxation times determine the 
recycle time between scans, etc.). The number of scans to be added is detemnined by 
the concentration of the analyte, the inherent sensitivity of the nucleus understudy and 
its abundance (either natural or enhanced by isotopic enrichment). 

20 

After data acquisition, a number of possible manipulations are possible. The FID can be - 
multiplied by a mathematical function to improve the signal-to-noise ratio or reduce the 
peak line widths. The expert operator has choice over such parameters. The FID is 
then often filled by a number of zeros and then subjected to Fourier transfonnatlon. After 

25 this conversion firom time-dependent data to frequency dependent data, it is necessary 
to phase the spectrum so that all peaks appear upright - this is done using two 
parameters by visual inspecbon on screen (now automatic routines are available with 
reasonable success). At this point the spectrum baseline can be curved. To remedy 
this, one defines points in the spectrum where no peaks appear and these are taken to 

30 be baseline. Usually, a polynomial function is fitted to these points, but other methods 
are available, and this function subtracted from the spectmm to provide a flat baseline. 
This can also be done in an automatic fashion. Other manipulations are also possible. It 
is possible to extend the FID fonrt^ards or backwanjs by "linear prediction" to improve 
resolution or to remove so<;alled truncation artefacts which occur if data acquisition of a 

35 scan is stopped before the FID has decayed into the noise. All of these decisions are 
also applicable to 2- and 3-dimenslonal NMR spectroscopy. 
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An NMR spectrum consists of a series of digital data points witii a y value (relating to 
signal strength) as a function of equally spaced x-values (frequency). These data point 
values run over the whole of the spectrum. Individual peaks in tiie spectrum are 
identified by the spectroscopist or automatically by software and the area under each 
peak is detemiined either by integration (summation of the y values of all points over the 
peak) or by curve fitting. A peak can be a single resonance or a multiplet of resonances 
corresponding to a single type of nucleus in a particular chemical environment (e.g.. the 
two protons ortho to the carboxyl group in benzoic add). Integration is also possible of 
the three dimensional peak volumes In 2-dimensional NMR spectra. The intensity of a 
peak in an NMR spectoim is proportional to the number of nuclei giving rise to that peak 
Cif the experiment Is conducted under conditfons where each successive accumulated 
free induction decay (RD) is taken starting at equilibrium). Also, the relative intensity of 
peaks from different analytes in the same sample is proportional to the concentration of 
that analyte (again if equilibrium prevails at the start of each scan). 

Thus, the temi "NMR spectral intensity," as used herein, pertains to some measure 
related to the NMR peak area, and may be absolute or relative. NMR spectral intensity 
may be, for example, a combination of a plurality of NMR spectral intensities, e.g., a 
linear combinatfon of a plurality of NMR spectral Intensities. 

In the context of NMR spectral intensity, the term "NMR" refers to any type of NMR 
spectroscoi^. 



NMR spectroscopic techniques can be classified according to the number of frequency 
axes and these include 1D-, 2D-, and 3D-NMR. ID spectra include, for example, single 
pulse; water-peak eliminated either by saturation or non-exdtation: spin-echo, such as 
CPMG (i.e., edited on the basis of spin-spin relaxation); diffusion-edited, selective 
excitation of specific spectra regions. 2D spectra include for example J-resolved (JRES); 
1H-1H conrelation methods, such as NOESY, COSY, TOCSY and variants thereof; 
heteronudearconelation including direct detection methods, such as HETCOR. and 
inverse-detected methods, such as 1H-13C HMQC, HSQC. HMBC. 3D spectra, indude 
many variants, all of whteh are combinations of 2D methods, e.g. HMQC-TOCSY, 
NOESY-TOCSY. eta AU of these NMR spectroscopic techniques can also be combined 
with maglo^gle^inning (MAS) in order to study samples other than Isotropic llqukis, 
such as tissues, which are characterised t>y anisotropic composition. 
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Preferred nuclei indude and ^'C. Preferred techniques for use in the present 
invention include water-peai< eiiminated, spin-echo such as CPIVIG, diffusion edited, 
JRES, COSY, TOCSY, HMQC, HSQC. and HIWBC. 

5 

NMR analysis (especially of blofluids) is canied out at as high a field strength as is 
practical, according to availability (very high field machines are not widespread), cost (a 
600 Mi-tz instrument costs about £500,000 but a shielded 800 MHz instrument can cost 
more than £3,500.000, depending on the nature of accessory equipment purchased), 
10 and ability to accommodate the physical size of the instrument. Maintenance/operational 
costs do not vary greatly and are small compared to the capital cost of the machine and 
the personnel costs. 

Typically, the observation frequency is from about 200 MHz to about 900 MHz, more 
1 5 typically from about 400 MHz to about 900 MHz. yet more typically from about 500 MHz 
to about 750 MHz. ^H observation frequencies of 500 and 600 MHz may be particularly 
preferred. Instruments vvrth the following ^H obsen/atiori frequencies areAAfere 
commercially available: 200, 250, 270 (discontinued), 300, 360 (discontinued), 400, 500, 
600, 700. 750, 800. and 900 MHz. 

20 

Higher frequendes are used to obtain better signal-to-nolse ratio and for greater spectral 
dispersion of resonances. This gives a better chance of identifying the molecules giving 
rise to the peaks. The benefit is not linear because in addition to the better dispersion, 
the detailed spectral peaks can move from being "second-order" * where analysis by 
25 inspection is not possible, towards "first-order," where ft Is. Both peak positions and 
intensities within multiplets change in a non-linear fashion as this progresston occurs. 
Lower observation frequencies would be used where cost is an issue, but this is likely to 
lead to reduced effectiveness for cfasstfication and identification of biomarkers. 

30 NMR SoectroscoDv: Sample Preparation 

NMR spectra can be measured in solid, liquid, liquid crystal or gas states over a range of 
temperatures from 120 K to 420 K and outside tfiis range with specialised equipment. 
Typically, NMR analysis of bioflulds is performed in the liquid state with a sample 
35 temperature of from about 274 K to about 328 K, but more typically from about 283 K to 
about 321 K. An example of a typical temperature is about 300 K. 
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Lower temperatures would be used to ensure that the biofluid did not suffer from any 
decomposition or show any effects of chemical or en2ymatic reactions during the data 
acquisition. Higher temperatures may be used to improve detection of certain species, 
5 For example, for plasma or semm. lipoproteins undergo a series of phase changes as 
the temperature is increased; in particular, the low density lipoprotein (LDL) peak 
Intensities are rather temperature dependent and the lines sharpen and broader more- 
difncult-4o-detect components become visible as the lipoprotein becomes more "liquid." 

1 0 Typically, biofluid samples are diluted with solvent prior to NMR analysis. This is done 
for a variety of reasons, including: to lessen solution viscosity, to control the pH of the 
solution, and to allow addition of reagents and reference materials. 

An example of a typical dilution solvent Is a solution of 0.9% by weight of sodium chloride 
16 in PzO. The D2O lessens the overall concentration of H20 and eases the technical 
requirements in the suppression of the solvent water NMR resonance, necessary for 
optimum detection of metabolite I^MR signals. The deuterium nuclei of the D2O also 
provides an NMR signal for locking the magnetic field enabling the exact co-registration 
of successive scans. 

20 

Depending on the avaflable amount of the bloflufcl. typically, the dilution ratio is from 
about 1:50 to about 5:1 by volume, but more typically from about 1:20 to about 1:1 by 
volume. An example of a typical dilution ratio is 3:7 by volume (e.g., 150 fU. sample, 
350 lO. solvent), typical for conventional 5 mm NIWR tubes and for flownnjection NMR 
25 spectroscopy. 

Typical sample volumes for NMR analysis are from about 50 (e.g.. for microprobes) to 
about 2 mL An ejoimple of a ^cal sample volume is about 500 jiL. 

30 NMR peak positions (chemical shifts) are measured relative to that of a known standard 
compound usually added directly to the sample. For biofluids such as urine this is 
commonly a partially deuterated fonn of TSP, i.e., 34rimethylsilyl-[2.2.3.3-^HJ-propionate 
sodium salt For biofluids containing high levels of proteins, this substance is not 
suitable since it binds to proteins and shows a broadened NMR line. Added fbmiate 

35 anion (e.g.. as a salt) can be used in such cases as for blood plasma. 
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NMR Spectroscopy: Manipulation of NMR Spectra 

NMR spectra are typically acquired, and subsequently, handled in digitised form. 
5 Conventional methods of spectral pre-processing of (digital) spectra are well known, and 
include. v\rtiere applicable, signal averaging, Fourier transfonnation (and other 
transformation methods), phase correction, baseline correction, smoothing, and the like 
(see, for example, Lindon et al., 1980). 

10 Modem spectroscopic methods often permit the collection of high or very high resolution 
spectra. In digital form, even a simple spectrum (e.g., signal versus spectroscopic 
parameter) may have many thousands, if not tens of thousands of data points. It is often 
desirable to reduce or compress the data to give fewer data points, for both practical 
computing methods and also to effect some degree of signal averaging to compensate 

15 for physical effects, such as pH variation, compartmentalisation, and the like. The 
resulting data may be referred to as "spectral data." 

For example, a typical NMR spectrum is recorded as signal intensity versus chemical 
shift (5) which ranges from about 5 0 to 5 10. At a typical chemical shift resolution of 
20 about 5 10^-10"^ ppm. the spectrum in digital fomi comprises about 10,000 to 100.000 
data points. As discussed above, it is often desirable to compress this data, for example, 
by a factor of about 10 to 100, to about 1000 data points. 

For example, in one approach, the chemical shift axis, 6, is "segmented" into "buckets" 
25 or "bins" of a specific length. For a 1-D NMR spectrum which spans the range from 6 
0 to 5 10. using a bucket length, A5, of 0.04 yields 250 buckets, for example, Q 10.0- 
9.96, 5 9.96-9.92, 5 9.92-9.88, etc., usually reported by their midpoint, for example, 5 
9.98, 5 9.94, 5 9.90, etc. The signal intensity within a given bucket may be averaged or 
integrated, and the resulting value reported. In this way, a spectrum wtth. for example, 
30 100,000 original data points can be compressed to an equivalent spectrum with, for 
example, 250 data points. 

A similar approach can be applied to 2-D spectra, 3-D spectra, and the like. For 2-D 
spectra, the 'T^ucket" approach may be extended to a "patch." For 3-D spectra, the 
35 "bucket" approach may be extended to a "volume." For example, a 2-D NMR 

spectrum which spans the range from 5 0 to 5 10 on both axes, using a patch of A5 0.1 x 
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A6 0.1 yields 10,000 patches. In this way, a spectrum with perhaps 10" original data 
points can be compressed to an equivalent spectrum of 10* data points. 

In this context, the equivalent spectrum may be refenBd to as "a spectral data set," "a 
5 data set comprising spectral data," etc. 

Software for such processing of NMR spectra, for example AMIX (Analysis of Mixture, V 
2.5, BrukerAnalytik. Rheinstetten, Gemnany) is commercially available. 

Often, certain spectral regions cany no real diagnostic infomiation, or cany conflicting 
Wochemical Infonmatlon, and it is often useful to remove these "redundant" regions 
before peifomning detailed analysis. In the simplest approach, the data points are 
deleted, in another simple approach, the data in the redundant regions are replaced with 
zero values. 

For example, due to the dynamic range problem with water in comparison with other 
molecules, the water resonance (around 6 4.7) is suppressed. However, small variations 
in vyater suppression remain, and these variations can undesirably complicate analysis. 
SImilarty, variations in water suppression may also affect the urea signal (around 5 6.0). 
by cross saturation. Therefore, it is often useful to delete certain spectral regions, for 
example, from about 5 4.5 to 6.0 (e.g., 5 4.52 to 6.00). 

In general. NMR data is handled as a data matrix. Typically, each row in the matrix 
coffesponds to an individual sample (often refened to as a "data vector^, and the entries 
in the columns are, for example, spectral intensity of a particular date point, at a 
particular 5 or A5 (often referred to as "descriptors"). 

It is often useful to pre-process date, for example, by addressing missing date, 
translation, scaling, weighting, etc. 

Multivariate projection methods, such as principal component analysis (PCA) and partial 
teast squares analysis (PLS), are so-called scaling sensitive methods. By using prior 
knowledge and experience about the type of date studied, the quality of the date prior to 
multivariate modelling can be enhanced by scaling and/or weighting. Adequate scaling 
and/or weighting can reveal the important and interesting variation hMden within in the 
date, and therefore malce subsequent multivariate modelling more effidert. Scaling and 
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weighting may be used to place the data in the correct metric, based on knowledge and 
experience of the studied system, and therefore reveal patterns already inherently 
present In the data. 

If at all possible, missing data, for example, gaps in column values, should be avoided. 
However, if necessary, such missing data may replaced or •filled" with, for example, the 
mean value of a column ("mean fill"); a random value ("random fill"); or a value based on 
a principal component analysis fpn'ndpal component fill"). Each of these different 
approaches will have a different effect on subsequent PR analysis. 

■Translation" of the descriptor coordinate axes can be useful. Examples of such 
translation Include normalisation and mean centring. 

"Normalisation" may be used to remove sample-to-sample variation. Many nomialisation 
approaches are possible, and they can often be applied at any of several points In the 
analysis. Usually, nomnalisation is applied after redundant spectral regions have been 
removed. In one approach, each spectrum is nonnalised (scaled) by a factor of 1/A, 
where A is the sum of the absolute values of all of the descriptors for that spectrum. In 
this way, each data vector has the same length, spectfically, 1. For example, if the sum 
of the absolute values of intensities for each bucket in a particular spectrum is 1067, then 
the intensity for each bucket for this particular spectrum is scaled by 1/1067. 

"Mean centring" may be used to simplify interpretation. Usually, for each descriptor, the 
average value of that descriptor for all samples is subtracted. In this way. the mean of a 
descriptor coincides with the origin, and all descriptors are "centred" at zero. For 
example, if the average intensity at 6 10.0-9.96, for all spectra, is 1,2 units, then the 
intensity at 5 10.0-9.96, for all spectra, is reduced by 1 .2 units. 

In "unit variance scaling," data can be scaled to equal variance. Usually, the value of 
each descriptor Is scaled by 1/StDev, where StDev is the standard deviation for that 
descriptor for all samples. For example, if the standard deviation at 6 10.0-9.96, for all 
spectra, is 2.5 units, then the intensity at 6 10.0-9.96, for all spectra, is scaled by 1/2.5 or 
0.4, Unit variance scaling may be used to reduce the impact of "noisy" data. For 
example, some metabolites in biofluids show a strong degree of physiological variation 
(e.g., diumal variation, dietary-related variation) that is unrelated to any 



15 



W« 02/086500 PCT/GB02/01854 

-66- 

pathophysiologlcal process. Without unit variance scaling, these noisy metabolites may 
dominate sut)sequent analysis. 

Tareto scaling" is, in some sense, intemiediate between mean centering and unit 
5 variance scaling. In effect, smaller peaks in the spectra can influence the model to a 
higher degree than for the mean centered case. Also, the loadings are, In general, more 
interpretabie than for unit variance based models. In pareto scaling, the value of each 
descriptor is scaled by 1/sqrt(StDev), where StDev is the standard deviation for that 
descriptor for all samples. In this way. each descriptor has a variance numerically equal 
10 to its initial standard deviation. The pareto scaling may be performed, for example, on 
raw data or mean centered data. 

'logarithmic scaling" may be used to assist interpretation when data have a positive 
skew and/or when data spans a large range. e.g.. several ordere of magnitude. Usually, 
for each descriptor, the value is replaced by the logarithm of that value. For example, 
the intensity at 6 10.0-9.96 is replaced the togarithm of the Intensity at 6 10.0-9.96, for all 
spectra. 

In "equal range scaling." each descriptor is divided by the range of that descriptor for all 

0 samples. In this way. all descriptora have the same range, that is. 1. For example, if, at 
6 10.(W.96. for all spectra, the largest value is 87 units and the smallest value is 1. then 
the range is 86 units, and the intensity at 6 10.0-9.96. for all spectra, is divided by 86 
units. However, this method is sensitive to presence of outlier points. 

5 In "autoscaling," each data vector is mean centred and unit variance scaled. This 
technique is a very useful because each descriptor is then weighted equally and, in the 
case of NMR descriptors, large and small peaks are treated with equal emphasis. This 
can be Important for metabolites present at very low, but still detectable, levels. 

1 Several supervised methods of scaling data are also known. Some of these can provWe 
a measure of the ability of a parameter (e.g.. a descriptor) to discriminate between 
classes, and can be used to Improve classifica«on by stretching a separatton. 

For example, in "variance weighting," the variance weight of a single parameter (e.g.. a 
descriptoi) Is calculated as the ratio of the inter-dass variances to the sum of the intra- 
dass variances. A large value means that this variable Is discriminating between the 
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classes. For example, if the samples are known to fall into two classes (e.g., a training 
set), it is possible to examine tlie mean and variance of each descriptor. If a descriptor 
has very different mean values and a small variance, then it will be good at separating 
the classes. 

5 

"Feature weighting" is a more general description of variance weighting, where not only 
the mean and standard deviation of each descriptor is calculated, but other well known 
weighting factors, such as the Fisher weight, are used. 

10 Multivariate Statistical Analysis 

As discussed above, multivariate statistics analysis methods, including pattern 
recognition methods, are often the most convenient and efficient way to analyse complex 
data, such as NMR spectra. 

15 

For example, such analysts methods may be used to identify, for example diagnostic 
spectral windows and/or diagnostic species, for a particular condition under study. 

Also, such analysis methods may be used to fonn a predictive model, and then use that 
20 model to classify test data. For example, one convenient and particulariy effective 

method of classification employs multivariate statistical analysis modelling, first to form a 
model (a "predictive mathematical moder7 using data ("modelling data") from samples of 
known class (e.g., from subjects known to have, or not have, a particular condition), and 
second to classify an unknown sample (e.g., 'test data"), as having, or not having, that 
25 condition. 

Examples of pattern recognition methods include, but are not limited to, Principal 
Ck>mponent Analysis (PCA) and Partial Least Squares-Discriminant Analysis (PLS-DA). 

30 PCA is a bilinear decomposition method used for oven/iewing "clusters" within 

multivariate data. The data are represented in K-dimensionat space (where K is equal to 
the number of variables) and reduced to a few prindpal components (or latent variables) 
which describe the maximum variation within the data, independent of any knowledge of 
class membership (i.e., "unsupennsed"). The principal components are displayed as a 

35 set of "scores" (t) which highlight clustering, trends, or outliers, and a set of "loadings" (p) 
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Which highlight the influence of Input variables on t See, for example. Kowalski et al 
1986). 

The PCA decomposition can be described by the following equation: 

X = TP' + E 

where T is the set of scores explaining the systematic variation between the 
observations in X and P is the set of loadings explaining the between variable variation 
and provides the explanation to clusters, trends, and outliers in the score space. The 
non-systematic part of the variation not explained by the model fomis the residuals, E. 

PLS-DA is a supervised multivariate method yielding latent variables describing 
maximum separation between known classes of samples. PLS-DA Is based on PLS 
which is the regression extension of the PCA method explained eariler. When PCA 
worics to explain maximum variation between the studied samples PLS-DA suffices to 
explain maximum separation between known classes of samples In the data PQ- This is 
done by a PLS regression against a "dummy vector or matrix" (Y) canying the dass 
separating infonnation. The calculated PLS components will thereby be more focused 
on describing the variation separating the classes in X if this infonnation is present in the 
data. From an interpretation point of view all the features of PLS can be used, which 
means that the variation can be Interpreted in tenns of scores (t,u). loadings (p. c), PLS 
weights (w) and regression coefficients (b). The fact that a regression is camed out 
against a known class separation means that the PLS-DA is a supen/ised method and 
that the dass membership has to be known prior to the actual modelling. Onoe a model 
Is cak:ulated and valkiated It can be used for predtetfon of dass membership for "new- 
unknown samples. Judgement of dass membership is done on basis of predicted dass 
membership (Ypred). predicted scores (tpred) and predided residuals (DmodXpred) 
using statistical significance limits for the dedslon. See. for example, Sjostrom et al., 
1986; Stable etaL. 1987. 

In PLS. the variation between the objects in X is described by the X-scores. T. and the 
variation in the Y-biock regressed against is described in the Y-scores, U. In PLS-DA 
the Y-blodc is a "dummy vector or matrix" describing the dass membership of each 
obsewation. Basically, what PLS does is to maximize the covariance between T and U. 
Foreachxomponent. a PLS weight vedor. w. is calculated, containing the influence of 
each X-variable on the explanation of the variation in Y. Together the weight vedors will 



y/0 02/086500 



-69- 



PCT/GB02/018S4 



form a matrix, W, containing tlie variation in X tiiat maximizes tlie covariance t)etween 
the scores T and U for each calculated component. For PLS-DA this means that the 
weights, W. contain the variation in X that is conrelated to the class separation described 
in Y. The Y-blodc matrix of weights is designated C. A matrix of X-loadings. P, is also 
5 calculated. These loadings are apart from interpretation used to perform the proper 
decomposition of X. 

The PLS decomposition of X and Y can hence be described as follows: 

X = TP' + E 

10 Y = TC' + F 



The PLS regression coefficients, B. are then given by: 

B = W(PW)'^C' 



1 5 The estimate of Y, Yhat. can then be calculated according to the following fbmiula: 

Yhat = XW(P'W)-^C' = XB 

Both of the pattem recognition algorithms exemplified herein (PCA, PLS-DA) rely on 
extraction of linear associations between the input variables. When such linear 

20 relationships are insufficient, neural network-based pattem recognition techniques can in 
some cases improve the ability to classify individuals on the basis of the many inter- 
related input variables (see, e.g., Ala-Korpela et al., 1995; Hiltunen et al., 1995). 
Nevertheless, the methods applied herein are sufRdently powerful to allow classification 
of the individuals studied, and they provide an additional benefit over neural networic 

25 methods in that they allow some information to be gained as to what aspects of the input 
dataset were particulariy important in allowing classification to be made. 



Spurious or irregular data in spectra C'outliers"), which are not representative, are 
preferably identified and removed. Common reasons for inreguiar data C'outiiers") 
30 include spectral artefacts such as poor phase correction, poor baseline correction, poor 
chemical shift referencing, poor water suppression, and biological effects such as 
bacterial contamination, shifts in the pH of the biofluid, to)dn- or disease-induced 
biochemical response, and other conditions, e.g., pathological conditions, which have 
metabolic consequences, e.g., diabetes. 



35 
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Ouflfere are identified in different ways depending on the method of analysis used. For 
example, when using principal component analysis (PCA), small numbers of samples 
lying far from the rest of the replicate group can be identified by eye as outiiere. A more 
objective means of identification for PCA is to use the Hotelling's T Test which is the 
5 muttivariate version of the well knovm Student's T test used in univariate statistics. For 
any given sample, the T2 value can be calculated and this is compared with a standard 
value within which a chosen fraction (e.g., 95%) of the samples would nonnally lie. 
Samples with T2 values substantially outside this limit can then be flagged as outliere. 

1 0 Also, when using more sopitisticated supervised methods, such as SIIWCA or PNNs, a 
similar method is used. A confidence level (e.g., 95%) is selected and the region of 
multivariate space conesponding to confidence values above this limit is detemiined. 
This region can be displayed graphically in several different ways (for example by 
plotting the criHcal T2 ellipse on a PCA scores plot). Any samples falling outside the high 

1 5 confidence region are flagged as potential outliers. 

Confidence Limits for outlier detection are also calculated in the residual direction 
expressed as the distance to model in X (OModX). 

20 Briefly. DModX is the perpendicular distance of an object to the principal component (or 
to the plane or hyper plane made up by two or more principal components), in the 
SIMCA software, OModX calculated as: 



25 



DiyiodX = V * sqrt(e^/KnA) 



wherein e is the residual for a single obsen^tion; 
K is the number of original variables In the data set; 
A is the number of prindpal components in the model; 

V Is a conection fector, based on the number of obsewations (N) and the number of 
30 principal components (A), and is slightly larger than one. 

The outliers in this direction are not as severe as those occumng in the score direction 
but should always be carefully examined before maidng a decision whether to include 
them in the modelling or not In general, an outliers are thoroughly investigated, for 
35 example, by examining the contnTjuting loadings and distance to model (DModX) as well 
as visually Inspecting the original NMR spectrum for deviating features, before removing 
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them from the model. Outlier detection by automatic algorithm is a possibility using the 
features of scores and residual distance to model (DModX) described above. 

When using PLS methods, the distance to the model in Y (DmodY) can also be 
calculated in the same way. 

Data Filterina 

Although pattern recognftion methods may be applied to "unfiltered" data, it is often 
preferable to first filter data to removed irrelevant variation. 

In one method, latent variables which are of no interest may be removed by 'filtering." 

Examples of filtering methods include the regression of descriptor variables against an 
index based on sample dass to eliminate variables with low conrelation to the predefined 
classes. Related methods Include target rotation (see. e.g.. Kvalheim et al., 1989) and 
PCT filtering (see, e.g., Sun. 1997). In these methods, the removed variation is not 
necessarily completely uncon-elated with sample class (I.e., orthogonal). 

In another method, latent variables which are orthogonal to some variation or class index 
of interest are removed by "orthogonal filtering." Here, variation In the data which is not - 
correlated to (I.e., is orthogonal to) the class separating variation of interest may be 
removed. Such methods are, in general, more efficient than non-orthogonal filtering 
methods. 

Various orthogonal filtering methods have been descnlted (see, e.g.. Wold et aL, ig98a; 
Feam, 2000; Anderson, 1999; Westeriiuis et al., 2001; Wise et al., 2001). 

One preferred orthogonal filtering method Is conventionally referred to as Orthogonal 
Signal ConBction (OSC), wherein latent variables orthogonal to the variation of Interest 
are removed. See, for example. Wold et al.. 1998a. 

The class identity is used as a response vector, Y. to describe the variation between the 
sample classes. The OSC method then locates the longest vector describing the 
variation between the samples which is not correlated with the Y-vector, and removes it 
from the data matrix. The resultant dataset has been filtered to allow pattern recognition 
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focused on the variation correlated to features of Interest within the sample population, 
rather than non-conelated, orthogonal variation. 

OSC is a method for spectral filtering that solves the problem of unwanted systematic 
variation in the spectra by removing components, latent variables, orthogonal to the 
response calibrated against. In PLS. the weights, w. are calculated to maximise the 
covariance between X and Y. In OSC, in contrast, the weights, w. are calculated to 
minimize the covariance between X and Y, which is the same as calculating components 
as dose to orthogonal to Y as possible. These components, orthogonal to Y, containing 
unwanted systematic variation are then subtracted from the spectral data. X, to produce 
a filtered predictor matrix describing the variation of Interest Briefly, OSC can be 
described as a bilinear decomposition of the spectral matrix, X, in a set of scores. T**, 
and a set of corresponding loadings, P**, containing varition orthogonal to the response, 
Y. The unexplained part or the residuals. E. is equal to the filtered X-matrix. Xosc 
containing less unwanted variation. The decpmposltlon is described by the following 
equation: 

X = T** P**' + E 
X.sc = E 

The OSC procedure starts by calculation of the first latent variable or principal 
component describing the variation in the data. X. The calculation Is done according to 
the NIPALS algorithm. 

X = tp' + E 

The first score vector, t. which is a summary of the between sample variation in X. Is 
then orthogonalized against response (Y), giving the orthogonalized score vector t*. 

t* = (i-Yonr)-^YOt 

After orthogonafcatlon. the PLS weights, w, are calculated with the aim of making Xw = 
f. By doing this, the weights, w, are set to minimis the covariance between X and Y. 
The weights, w, are given by: 
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An estimate of the orthogonal score t** is calculated from: 

l** = Xw 

The estimate or updated score vector t** is then again orthogonalized to Y. and the * 
iteration proceeds until t** has converged. This will ensure that t** will converge towards 
the longest vector orthogonal to response Y, still giving a good description of the 
variation in X. The data. X. can then be described as the score, t**. orthogonal to Y. 
times the corresponding loading vector p**. plus the unexplained part, the residual, E. 

X = r*p**' + E 

The residual, E, equals the filtered X, Xosc. after subtraction of the first component 
orthogonal to the response Y. 

E = X.t**p**' 
Xosc = E 

If more than one component needs to be removed, the same procedure is repeated 
using the residual, E, as the starting data matrix, X. 

New extemal data not present In the model calculation must be treated according to 
filtering of the modelling data. This is done by using the calculated weights, w, from the 
filtering to calculate a score vector, Uw, for the new data, Xnw- 

tnw ~ Xnew W 

By subtracting tnw times the loading vector firom the calibration, p**, firom the new 
extemal data. Xnew. the residual, E^^vr, will be the resulting OSC filtered matrix for the 
new extemal data. 

If PCA suggests separation between the classes under investigation, orthogonal signal 
conecHon (OSC) can be used to optimize the separation, thus improving the 
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performance of subsequent multivariate pattern recognition analysis and enhancing the 
predictive power of the model. In the examples described herein, both PCA and PLS-DA 
analyses were improved by prior application of OSC. 

An example of a typical OSC process includes the following steps: 

(a) NMR data are segmented using AMIX, normalised, and optionally scaled 
and/or mean centered. The default for orthogonal filtering of spectral data is to use only 
mean centered data, which means that the mean for each variable (spectral bucket) is 
subtracted from each single variable in the data matrix. 

(b) a response vector (y) describing the class separating variation is created by 
assigning dass membership to each sample. 

(c) one latent variable orthogonal to the response vector (y) is removed according 
to the OSC algorithm. 

(d) if desired, the removed orthogonal variation can be viewed and interpreted in 
terms of scores (T) and loadings (P). 

(e) the filtered data matrix, which contains less variation not correlated to class 
separation, is next used for further multivariate modelling after optional scaling and/or 
mean centering. 

Any particular model is only as good as the data used to fomiulate It Therefore, it is 
preferable that all modelling data and test data are obtained under the same (or similar) 
conditions and using the same (or similar) experimental parameters. Such conditions 
and parameters include, for example, sample type {e.g., plasma, serum), sample 
collection and handling protocol, sample dilution, NMR analysis (e.g., type, field 
strength/frequency, temperature), and dataijrocessing (e.g., referencing, baseline 
correction, normalisation). If appropriate, it may be desirable to fomiulate models for a 
particular subgroup of cases, e.g., according to any of the parameters mentioned above 
(e.g., field strength/frequency), or others, such as sex. age, ethnicity, medical history, 
lifestyle (e.g., smoker, nonsmoker), honnonai status (e.g., pre-menopausal, post- 
menopausaQ. 

In general, the quality of the model improves as the amount of modelling data increases. 
Nonetheless, as shown in the examples below, even relatively small sets of modelling 
data (e.g.. about 50-100 subjects) is sufficient to achieve a confident classification (e.g., 
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A typical unsupervised modelling process includes the following steps: 

(a) optionally scaling and/or mean centering modelling data; 

(b) classifying data (e.g., as control or positive, e.g., diseased); 

(c) fitting the model (e.g., using PCA, PLS-DA); 
5 (d) identifying and removing outliers, if any; 

(e) re-fitting the model; 

(f) optionally repeating (c), (d), and (e) as necessary. 

Optionally (and preferably), data filtering Is perfonned following step (d) and before 
10 step (e). Optionally (and preferably), orthogonal filtering (e.g., OSC) is perfomied 
following step (d) and before step (e). 

An example of a typical PLS-DA modelling process, using OSC filtered data, includes the 
following steps: 

1 5 (a) OSC filtered data is optionally scaled and/or mean centered. 

(b) a response vector (y) describing the dass separating variation is created by 
assigning class membership to all samples. 

(c) a PLS regression model is calculated between the OSC filtered data and the 
response vector (y). The calculated latent variables or PLS components will be focused 

20 on describing maximum separation between the known classes. 

(d) the model is interpreted by viewing scores (T), loadings (P), PLS weights (W), 
PLS coefficients (B) and residuals (E). Together they will function as a means for 
describing the separation between the classes as well as provide an explanation to the 
observed separation. 

25 

Once the model has been calculated, it may be verified using data for samples of ioiown 
class which were not used to calculate the model. In this way, the ability of the model to 
accurately predict classes may be tested. This may be achieved, for example, in the 
method above, with the following additional step: 
30 (e) a set of extemal samples, with known dass belonging, which were not used in 

the (e.g., PLS) model calculation is used for validation of the model's predictive ability. 
The prediction results are investigated, fore example, in terms of predicted response 
(ypred), predicted scores (Tpied). and predided residuals described as predicted distance 
to model (DmodXpred). 

35 
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The model may then be used to classify test data, of unknown class. Before 
classification, the test data are numerically pre-processed in the same manner as the 
modelling data. 

Interpreting the output from the pattern recognition (PR) analysis provides useful 
infonnation on the biomarkers responsible for the separation of the biological classes. 
Of course, the PR output differs somewhat depending on the data analysis method used. 
As mentioned above, methods for PR and interpretation of the results are known in the 
art. Interpretation methods for two PR techniques (PCA and PLS-DA) are discussed 
briefly herein. 

Interoretina PCA Results 

The data matrix (X) is built up by N obsen^ations (samples, rats, patients, etc.) and K 
variables (spectral buckets carrying the biomaricer infonnation in temis of ^H-NMR 
resonances). 

In PCA, the N*K matrix (X) is decomposed into a few latent variables or principal 
components (PCs) describing the systematic variation in the data. Since PCA is a 
bilinear decomposition method, each PC can be divided into two vectors, scores (t) and 
loadings (p). The scores can be described as the projection of each obsen^ation on to 
each PC and the toadings as the contribution of each variable (spectral bucket) to the PC 
expressed in tenns of direction. 

Any clustering of observations (samples) along a direction found In scores plots (e.g., 
PCI versus PC2) can be explained by Wentifying which variables (spectral buckets) 
have high loadings for this particular direction In the scores. A high loading is defined as 
a variable (spectral bucket) that changes between the observations in a systematfe way 
showing a trend which matches the sample positions in the scores plot. Each spectral 
bucket with a high loading, or a combination thereof, is defined by its NMR chemical 
shift position; this is its diagnostic spectral window. These chemical shift values then 
allow the skilled NMR spectroscopist to examine the original NMR spectra and identify 
the molecules giving rise to the peaks in the relevant buckets; these are the biomaricers. 
This is typically done using a combinafion of standard 1- and 2-dimensional NMR 
methods. 
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If, in a scores plot, separation of two classes of sample can be seen in a particular 
direction, then examination of tliose loadings which are in the same direction as in the 
scores plots indicates which loadings are important for the class identification. The 
toadings pfot shows points which are labelled acconJing to the bucket chemical shift. 
5 This is the NMR spedroscopic chemical shift which con^esponds to the centre of the 
bucket This bucket defines a diagnostic spectral window. Given a list of these bucket 
Identifiers, the skilled NMR spectroscopist then re-examines the NMR spectra and 
identifies, within the bucket width, which of several possible NMR resonances are 
changed between the two classes. The important resonance is characterised in tenns of 
10 exact chemical shift, intensity, and peak multiplicity. Using other NMR experiments, 
such as 2-D NMR spectroscopy and/or separation of the specific molecule using 
HPLC-NMR-MS for example, other resonances from the same molecule are identified 
and ultimately, on the basis of all of the NMR data and other data if appropriate, an 
identification of the molecule (biomarker) is made. 

15 

In a classification situation as described herein, one procedure for finding relevant 
biomarkers using PCA is as follows: 

(a) PCA of the data matrix (K) containing N obsen/ations belonging to either of two 
20 known classes (healthy or diseased). The description of the observations lies in the K 

variables (spectral buckets) containing the biomaricer infomiation in tenns of NMR 
resonances. 

(b) Interpretation of the scores (t) to find the direction for the separation between Oie two 
25 known classes in X. 

(c) Interpretation of loadings (p) reveals which variables (spectral buckets) have the 
largest impact on the direction for separation described in the scores (t). This identifies 
the relevant diagnostic spectral windows. 

30 

(d) Assignment of the spectral buckets or combinations thereof to certain blomaricers. 
This is done, for example, by interpretation of the resonances in NMR spectra and by 
using previously assigned spectra of the same type as a library for assignments. 
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Interpreting PLS-DA Resulta 



In PLS-DA, which is a regression extension of the RCA method, the options for 
interpretation are more extensive compared to the RCA case. RLS-DA perfonns a 
5 regression between the data matrix QQ and a "dummy matrix" (Y) containing the class 
membership infomiation (e.g., samples may be assigned the value 1 for healthy and 2 
for diseased classes). The calculated PLS components will describe the maximum 
covariance between X and Y which In this case is the same as maximum separation 
between the known classes in X. The interpretation of scores (t) and loadings (p) is the 
10 same In PLS-DA as in RCA Interpretation of the PLS weights (w) for each component 
provides an explanation of the variables In X correlated to the variation in Y. This will 
give blomarirer Infonnation for the separation between the classes. 

Since PLS-DA is a regression method, the features of regression coefRcients (b) can 
15 also be used for discovery and interpretation of biomaricere. The regrBsslon coefficients 
(b) in PLS-DA provide a summary of which variables in X (spectral buckets) that are 
most important in tenns of both describing variation in X and con-elating to Y. This means 
that variables (spectral buckets) with high regression coefficients are important for 
separating the known classes in X since the Y matrix against which it is correlated only 
20 contains infonnation on the class Wentity of each sample. 

Again, as discussed above, the scores plot is examined to Identify important loadings, 
diagnostic spectral windows, relevant NMR fBsonances. and ultimately the associated 
biomari«ers. 

25 

In a classification situation as described herein, one procedure for finding relevant 
blomariters using PLS-DA Is as follows: 



(a) A PLS model between the N*K data matrix (X) against a "dummy matrix" Y. 

30 containing infonnation on dass membership for tiie observations in X. is calculated 
yieWing a few latent variables (PLS components) describing maximum separation 
between ttie two classes in X (e.g.. healthy and diseased). 

(b) Interpretation of flie scores (t) to find the direction for the separation between tiie two 
35 known classes In X. 
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(c) Interpretation of loadings (p) revealing which variables (spectral buckets) have the 
largest impact on the direction for separation described in the scores (t); these are 
diagnostic spectral windows. 

5 In PLS-DA, a variable importance plot (VIP) is another method of evaluating the 
significance of loadings in causing a separation of class of sample in a scores plot. 
Typically, the VIP is a squared function of PLS weights, and therefore only positive 
numerical values are encountered; in addition, for a given model, there is only one set of 
VIP-values. Variables with a VIP value of greater than 1 are considered most influential 
1 0 for the model. The VIP shows each loading in a decreasing order of importance for class 
separation based on the PLS regression against class variable. 

A (w*c) plot is another diagnostic plot obtained from a PLS-DA analysis. It shows which 
descriptors are mainly responsible for class separation. The (w*c) parameters are an 

1 5 attempt to describe the total variable correlations in the model, i.e., between the 
descriptors (e.g., NIMR intensities in buckets), between the NI\4R descriptors and the 
dass variables, and between class variables If they exist (in the present two dass case, 
where samples are assigned by definition to dass 1 and dass 2 there is no correlation). 
Thus for a situation in a scores plot (e.g.. t1 vs^ t2). if dass 1 samples are clustered in 

20 the upper right hand quadrant and class 2 samples are clustered in the lower left hand 
quadrant, then the (w*c) plot will show descriptors also in these quadrants. Descriptors in 
the upper right hand quadrant are increased in class 1 compared to class 2 and vice 
versa for the lower left hand quadrant 

25 (d) Interpretation of PLS weights (w) reveals which variables (spectral buckets) in X are 
important for correlation to Y (dass separation); these, too, are diagnostic spectral 
windows. 

(e) Interpretation of the PLS regression coeffidents (b) reveals an overall summary of 
30 which variables (spectral buckets) have the largest impact on the direction for separation 
described in the scores; these, too, are diagnostic spectral windows. 

In a typical regression coeffident plot for NMR, each bar represents a spectral region 
(e.g., 0.04 ppm) and shows how the NMR profile of one dass of samples differs firom 
35 the NMR profile of a second class of samples. A positive value on the x-axis 

indicates there is a relatively greater concentration of metabolite (assigned using NMR 
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chemlcal shift assignment tables) in one dass as compared to the other dass. and a 
negative value on the x-axis Indicates a relatively lower concentration in one dass as 
compared to the other dass. 

5 (0 Assignment of the spectral budgets or combinations thereof to certain biomarKers! 
This is done, for example, by interpretation of the resonances in NIVIR spedra and by 
using previously assigned spectra of the same type as a fibrary for assignments. 



10 



Timed Sampling 



The analysis methods descn-bed herein can be applied to a single sample, or 
alternatively, to a timed series of samples. These samples may be taken relatively dose 
together in time (e.g.. daily) or less frequently (e.g., monthly or yeariy). 

15 The timed series of samples may be used for one or more purposes. e.g.. to make 

sequential diagnoses, applying the same dassification method as If eadi sample were a 
single sample. This will allow greater confidence in the diagnosis compared to obtaining 
a single sample for the patient, or alternatively to monitor temporal changes in the . 
subjed (e.g., dianges in the underlying condition being diagnosed, treated, etc.). 

AltemativBly. the timed series of samples can be collectively treated as a single dataset 
Increasing tiie Infomiation density of the input dataset and hence increasing the power of 
the analysis meUiod to Mentify weaker patterns. 

As yet anotiier altemative. the timed series of samples can be collectively processed to 
yield a single dataset in whidi tfie temporal dianges (e.g.. in eadt bin) is Induded as an 
extra list of variables (e.g.. as in composite data sets). Temporal dianges in ttie amount 
of (e.g., endogenous) diagnostic species may greatiy improve Oie abnity of tiie analysis 
mettiod to accurate dassify patterns (espedaliy when patterns are weak). 

Batch Modellinfi 



TTie meUiods described herein, including Uieir applications (e.g.. diagnosis, prognosis), 
may be further improved by employing batdi modefling. 
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Statistical batcli processing can be divided into two leveis of multivariate modelling. The 
lower or tlie observation level is usually based on Partial Least Squares (PLS) 
regression against time (or any other index describing process maturity), whereas the 
upper or batch level consists of a RCA based on the scores from the lower level PLS 
5 model. PLS can also be used In the upper level to correlate the matrix based on the 
lower level scores with the end properties of the separate batches. This is common in 
industrial applications where properties of the end product are used as a description of 
quality. 

10 At the lower level of the Batch modelling the evolution of the studied process with time 
(maturity) can be monitored and interpreted in temis of PLS scores and loadings. Since 
the PLS performs a regression against sampling time (maturity), the calculated 
components will be focused on the evolution with time. The fact that the calculated PLS 
components are orthogonal to each other means that It Is possible to detect independent 

1 5 time (maturity) profiles and also to interpret which measured variables are causing these 
profiles. Confidence limits are used for detection of de\^ating behaviour of any spectra at 
any time point for some optional significance level, usually 95% and/or 99%. 

The residuals expressed as distance to model (DModX) is, at the lower level, another 
20 important tool for detecting outlying batches or deviating behaviour for a specific batch at 
a specific time point. The upper level or batch level provides the possibility to just look at 
the difference between the separate batches. This is done by using the lower level 
scores including all time points for each batch as new variables describing each single 
batch and then performing a PCA on this new data matrix. The features of.scores, 
25 loadings and DmodX are used in the same way as for ordinary PCA analysis, wtUi the 
exception that the upper level loadings can be traced back down to the lower level for a 
more detailed explanatidn in Oie original loadings. 

Predictions for "new" batches can be done on botfi levels of the batch model. On the 
30 lower level monitoring of evolution with time using scores and DmodX Is a powerful tool 
for detecting deviating behaviour from normality for batch at any time point. On the 
upper level prediction of single batch behaviour can be done in tenms of scores and 
DmodX. 

35 The definition of a batch process, and also a requirement for batch modelling, is a 
process where all batches have equal duration and are synchronised acconjing to 
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sample collection. For example, samples taken from a cohort of animals at identical 
fixed time points to monitor the eflects of an administered xenobiotic substance. 

The advantage of using batch modelling for such studies is the possibility of detecting 
known, or discovering new, metabolic processes which evolve with time in the lower 
level scores, and also the Identification of the actual metabolites involved in the different 
processes from the contributing lower level loadings. The lower level analysis also 
makes it possible to differentiate between single obsen/atlons (e.g., individual animals at 
specific time points). 



Applications for the lower level modeUing include, for example, distinguishing between 
undosed controls and dosed animals In tenns of metabolic effects of dosing in certain 
time points; and creating models for nomiality and using the models as a classification 
tool for new samples, e.g., as normal or abnomial. This may be achieved using a PLS 
1 5 prediction of the new sample's class using the model describing nomiality. Decisions 
can then be made on basis of the combination of the predicted scores and residuals 
(DmodX). 

An automated expert system can be used for eariy fault detection in the lower level batch 
20 modefllng. and this can be used to further enhance the analysis procedure and improve 
efRciency. 



The upper level provides the possibility of making predictions of new animals using the 
existing model. Abnomial animals can then be detected by Judging predicted scores and 
residuals (DmodX) together. Since the upper level model Is based on the lower level 
scores, the interpretation of an animal predicted to be abnormal can be traced back to 
the original lower level scores and loadings as well as the original raw variables making 
up the NMR spectra. Combining the upper and lower level for prediction of the status of 
a new animal, the classification can be based on four parameters: upper level scores 
and residuals (DmodX) and lover level scores and residuals (DModX). This 
demonstrates that batch modelling is an efficient tool for detennining if an animal is 
normal or abnormal, and If the latter, why and when they are deviating fi-om nomiality. 



35 



See, for example. WoW et al. 1998b and Eriksson et al., 1999. 
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Intearated Metabonomics 

As discussed above, many of the methods of the present invention may also be applied 
to composite data or composite data sets. The term "composite data set," as used 
5 herein, pertains to a spectrum (or data vector) which comprises spectral data (e.g.. NI\/IR 
spectral data, e.g., an NMR spectrum) as well as at least one other datum or data vector. 
Examples of other data vectors include, e.g., one or more other NMR spectral data, e.g., 
NMR spectra, e.g., obtained for the same sample using a different NMR technique; other 
types of spectra, e.g., mass spectra, numerical representations of images, etc.; obtained 
10 for the another sample, of the same sample type (e.g., blood, urine, tissue, tissue 
extract), but obtained from the subject at a different timepoint; obtained for another 
sample of different sample type (e.g., blood, urine, tissue, tissue extract) for the same 
subject; and the like. 

15 Examples of other data including, e.g.. one or more clinical parameters. Clinical 

parameters which are suitable for use in composite methods include, but are not limited 
to, the following: 

(a) established clinical parameters routinely measured in hospital clincai labs: age; sex; 
20 body mass index; height; weight; family history; medication history; cigarette smoking; 

alcohol intake; blood pressure; full blood cell count (FBCs); red blood cells; white blood 
cells; monocytes; lymphocytes; neutrophils; eosinophils; basophils; platelets; 
haematocrit; haemoglobin; mean corpuscular volume and related haemodilution 
indicators; fibrinogen; functional clotting parameters (thromoboplastin and partial 
25 thromboplastin); electrolytes (sodium, potassium, calcium, phosphate); urea; creatinine; 
total protein; albumin; globulin; bilirubin; protein markers of liver function (alanine 
aminotransferase, alkaline phosphatase, gamma glutamyl transfierase); glucose; Hbalc 
(a measure of glucose-Haemoglobin conjugates used to monitor diabetes); lipoprotein 
profile; total cholesterol; LDL; HDL; triglycerides; blood group. 

30 

(b) established research parameters routinely measured in research laboratories but not 
usually measured in hospitals: hormonal status; testosterone; estrogen; progesterone; 
follicle stimulating honnone; inhibin; transfomiing growth factor-beta1; Transforming 
growth factor-beta2; chemoWnes; MCP-1; eotaxfn; plasminogen activator inhibltor-1; 

35 cystatin C. 
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(c) early-Stage research parameters measured In one or a small number of specialist 
labs: antibodies to sRII; antibodies to blood group A antigen; antibodies to blood group B 
antigen; immunoglobulin (IgD) against alpha-gal; immunoglobulin (IgD) against penta- 
gal. 

DiaQnostic Spectral Windows 

As discussed above, many of the methods of the present invention involve relating NIWR 
spectral intensity at one or more predetemilned diagnostic spectral windows with a 
predetemnined condition. 

Examples of methods for identifying one or more suitable diagnostic spednal windows for 
a given condition, using, for example, pattern recognition methods, are described herein. 

The tenn "diagnostic spectral window," as used herein, pertains to narrow range of 
chemical shift (A6) values encompassing an index value, 5r (that Is, 6r falls wfthln the 
range A6). Each index value, and its associated spectral window, define a range of 
chemical shift (A5) in which the NMR spectral intensity is Indicative of the presence of 
one or more chemical species. 

For 2D NMR methods, the diagnostic spectral window refers to a chemical shift patch 
(A6i, A62) which encompasses an index value. [5^, 6^. For 3D NMR methods, the 
diagnostic spectral window refers to a chemical shift volume (A61. A62. A63) which 
encompasses an index value, [Sn, 5,2, 6^]. 

In one embodiment the spectral window Is centred with respect to its Index value (e.g., 
dr = 1.30; ia2| = S 0.04, and A6 1.28-1.32). 

The breadth of the range, lAiJ, is determined largely by the spectroscopic parameters, 
such as field strength/frequency, temperature, sample viscosity, etc. The breadth of the 
range is often chosen to encompass a typical spin-coupled muKipIet pattern. For peaks 
whose position varies with sample pH, the breadth of the range is may be widened to 
encompass the expected range of positions. 

Typically, the breadth of the range. lAe* is from about 6 0.001 to about 5 0.2. 
In one embodiment, the breadth is from about 6 0.005 to about 6 0. 1 . 
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In one embodiment, the breadth is from about 5 0.005 to about 5 0.08. 
In one embodiment the breadth is from about 5 0.01 to about 5 0.08. 
In one embodiment, the breadth is from about 5 0.02 to about 5 0.08. 
In one embodiment, the breadth is from about 6 0.005 to about 6 0.06. 
5 In one embodiment, the breadth Is from about 5 0.01 to about 5 0,06. 
In one embodiment, the breadth is from about 5 0.02 to about 5 0.06. 
In one embodiment, the breadth is about 5 0.04. 

In one embodiment, the breadth is equal to the "bucket" or "bin" width. In one 
1 0 embodiment, the breadth is equal to an integer multiple of the "bucket" or "bin" width. 

Although the diagnostic spectral windows are detennined in reiation to the condition 
under study, the predse index values for such windows may vary in accordance vwth the 
experimental parameters employed, for example, the digital resolution in the original 
1 5 spectra, the width of the buckets used, the temperature of the spectral data acquisition, 
etc. The exact composition of the sample (e.g., biofluid, tissue, etc.) can affect peak 
positions by compartmentation, metal complexatlon, protein-small molecule binding, etc. 
The observation frequency will have an effect because of different degrees of peak 
overiap and of first/second order nature of spectra. 

20 

In one embodiment, said one or more predetermined diagnostic spectral windows is: a 
single predetermined diagnostic spectral window. 

In one embodiment, said one or more predetermined diagnostic spectral v\nndows is: a 
25 plurality of predetermined diagnostic spectral windows. In practice, this may be 
preferred. 

Although the theoretical limK on the number of predetemnined diagnostic spectral 
windows is a function of the data density (e.g., the number of variables, e.g., buckets), 
30 typically the number of predetenmlned diagnostic spectral windows is from 1 to about 30. 
It is possible for the actual number to be In any sub-range within these general limits. 
Examples of lower limits include 1 , 2, 3. 4. 5, 6. 8, 10, and 15. Examples of upper limits 
include 3, 4, 5, 6, 8, 10, 15, 20, 25, and 30. 



35 



In one embodiment, the number is from 1 to about 20. 
In one embodiment, the number is from 1 to about 15. 
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In one embodiment, the number is from 1 to about 10. 
In one embodiment, the number is from 1 to about 8. 
In one embodiment, the number is from 1 to about 6. 
In one embodiment, the number is from 1 to about 5. 
5 In one embodiment, the number is from 1 to about 4. 
In one embodiment, the number is from 1 to about 3. 
In one embodiment, the number is 1 or 2. 

In one embodiment, said one or more predetemiined diagnostic spedral windows is: a 
10 plurality of diagnostic spectral windows; and, said NMR spectral intensity at one or more 
predetemiined diagnostic spectral windows Is: a combination of a plurality of NMR 
spectral intensities, each of which is NMR spectral intensity for one of said plurality of 
predetermined diagnostic spectral wfridows. 

1 5 In one embodiment, said combination is a linear combination. 

In one embodiment, at least one of said one or more predetemiined diagnostic spectral 
windows encompasses a chemical shift value for an NMR resonance of a diagnostic 
species (e.g., a NMR resonance of a diagnostic species). 

0 

In one embodiment, each of a plurality of said one-or more predetemiined diagnostic 
spectral windows encompasses a chemical shift value for an NMR resonance of a 
diagnostic species (e.g., a NMR resonance of a diagnostic spedes). 

5 In one embodiment, each of said one or more predetemiined diagnostic spectral 
windows encompasses a chemical shift value for an NMR resonance of a diagnostic 
species (e.g., a NMR resonance of a diagnostic species). 



) 



Diagnostic Spectral Win dows Atherosclerosis/CHD 

It Is believed that tiie index values, and tiie associated diagnostic spectral windows, 
primarily reflect the species described in Table 4-CHD. 

In one embodiment, said predetermined diagnostic spectral windows are defined by one 
or more index values. 5,. corresponding to the bucket regions listed in Table 4-CHD. 
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In one embodiment said predetennlned diagnostic spectral windows are defined by one 
or more index values, dr, corresponding to the budcet regions listed in Table 4-CHD. and 
breadth of the range value. |&S| about 0.04. 

5 In one embodiment said predetennined diagnostic spectral windows are defined by one 
or more index values, 5^ coaesponding to the bucket regions listed in Table 4-CHD, and 
which are determined using the conditions set forth in the section entitled 
"NMR Experimental Parameters." 

10 Diagnostic Soedes and Biomafkers 

The index values, and the associated diagnostic spectral windows, define ranges of 
chemical shift in which NMR spectral intensity is indicative of the presence of one or 
more chemical spedes, one or more of which are diagnostic species (e.g., biomarkers). 
1 5 for example, for a condition (e.g., Indication) under study. 

In one embodiment, said one or more diagnostic spedes are endogenous diagnostic 
spedes. 

20 In one embodiment, said one or more diagnostic species are assodated with NMR 
spectral intensity at predetermined diagnostic spectral windows. 

In one embodiment, said one or more diagnostic species are a plurality of diagnostic 
species (i.e., a combination of diagnostic species). 

25 

In one embodiment, said one or more diagnostic spedes is a single diagnostic spedes. 

The term "endogenous species," as used herein, pertains to chemical species which 
originated from the subject under study, for example, which were present in the sample 
30 of the subject. 

Once an index value, and its associated diagnostic spectral window, is identified (e.g.. by 
the application of modelling methods as described herein), it is often possible to identify 
one or more putative biomarkers which give rise to NMR spectral intensity in that 
35 particular window. 
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The (e.g., integrated) NMR spectral intensity in a particular spectral window 
(e.g., bucket) is the sum of the spectral intensity for all of the NMR peaks in that window. 
Usually for small molecules which give sharp NMR peaks, it is possible to examine the 
raw NMR data and detenmine which of the peaks is responsible for that particular 
5 spectral window being selected as significant by the applied pattern recognition method. 
The relevant peak{s) are then assigned. 

Such assignments may be made, for example, by reference to published data; by 
comparison with spectra of authentic materials; by standard addition of an authentic 
10 reference standard to the sample; by separating the Individual component, e.g., by using 
HPLC-NMR and identifying it using NMR and mass spectrometry. Additional 
confirmation of assignments is usually sought from the application of other NMR 
methods, including, for example, 2-dimensional (2D) NMR methods. 

15 In another approach, concentrations of candidate chemical species are measured by 
another specific method (e.g., ELISA, chromatography, RIA, etc.) and compared with the 
spectral intensity observed in the relevant diagnostic spectral window, and any 
correlation noted. This will reveal how much of the variance in the diagnostic spectral 
window is contributed by the candidate chemical species. This may also reveal that 

20 suspected diagnostic species are. in feet, not highly correlated with the condition under 
examination. 

Methods of Identifvina Diagnostic Species 

25 Thus, the methods described herein also facilitate the identification of species (often 
refenBd to as biomarkers or diagnostic species) which are indicative (e.g., diagnostic) of 
a particular condition. For e)aimple, particular metabolites (e.g.. in blood, urine, etc.) 
may be diagnostic of a particular condition. 

30 One aspect of the present invention pertains to a metiiod of identifying such diagnostic 
species (e.g., biomaricers), as described herein. 

One aspect of the present invention pertains to a method of identifying a diagnostic 
species, or a combination of a plurality of diagnostic species, for a predetemiined 
condition, said method comprising the steps of: 

(a) applying a multivariate statistical analysis mettiod to experimental data; 



wo 02/086500 PCT/GB02/qi854 

-89- 

wherein said experimental data comprises at least one data comprising 
experimental parameters measured for each of a plurality of experimental samples; 

wherein said experimental samples define a class group consisting of a plurality 
of classes; 

5 wherein at least one of said plurality of classes is a class associated with said 

predetermined condition, e.g., a class associated with the presence of said 
predetemiined condition; 

wherein at least one of said plurality of classes is a class not associated with said 
predetennined condition. e.g.. a class associated with the absence of said 
10 predetermined condition; 

wherein each of said experimental samples is of Icnown class selected from said 
class group; 

and: 

15 

(b) identifying one or more critical experimental parameters; 

wherein each of said critical experimental parameters is statistically significantly 
different for dasses of said class group, e.g., is statistically significant for discriminating 
between classes of said class group; and, 
20 (c) matching each of one or more of said one or more critical experimental 

parameters with said diagnostic spedes; 

on 

25 (b) identifying a combination of a plurality of critical experimental parameters; 

wherein said combination of a plurality of critical experimental parameters is 
statistically significantly different for dasses of said dass group, e.g.. is statistically 
significant for discriminating between dasses of said class group; and, 

(c) matching each of one or more of said plurality of critical experimental 
30 parameters with said combination of a plurality of diagnostic spedes. 

In one embodiment, one or more of said critical experimental parameters is a spedral 
parameter (I.e., a critical experimental spedral parameter); and said identifying and 
matching steps are: 

35 (b) identifying one or more critical experimental spectral parameters; and, 
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(c) matching each of one or more of said one or more critical experimental 
spectral parameters with a spectral feature, e.g., a spectral peak; and 

matching one or more of said spectral peaks with said diagnostic species; 

on 



(b) identi^ing a combination of a plurality of critical experimental spectral 
parameters; and, 

(c) matching each of a plurality of said plurality of critical experimental spectral 
10 parameters with a spectral feature, e.g., a spectral peak; and 

matching one or more of said spectral peaks with said combination of a plurality 
of diagnostic species. 

In one embodiment, said multivariate statistical analysis method is a multivariate 
1 5 statistteal analysis method whfch employs a pattern recognition method. 

In one embodiment, said multivariate statistical analysis method is. or employs PCA. 

In one embodiment, said multivariate statistical analysis method is, or employs PLS. 

20 

In one embodiment, said multivariate stetistical analysis method Is, or employs PLS-DA. 

In one embodiment, said multivariate statistical analysis method includes a step of data 
filtering. 

25 

In one embodiment, said multivariate statistical analysis method includes a step of 
orthogonal data filtering. 

In one embodiment, said multivariate statistical analysis method includes a step of OSC. 

30 

In one embodiment, said experimental parameters comprise spectral data. 

In one embotfiment, said experimental parameters comprise both spectral data and 
non-spedral date (and is r^erred to as a "composite experimental data"). 

35 



one embodiment, said experimental parameters comprise NMR spectral data. 
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In one embodiment, said experimental parameters comprise botli NMR spectral data and 
non-NMR spectral data. 

5 In one embodiment, said NMR spectral data comprises NMR spectral data and/or ^^C 
NMR spectral data. 

In one embodiment said NMR spectral data comprises NMR spectral data. 
10 in one embodiment, said non-spectral data is non-spectral clinical data. 

In one embodiment, said non-NMR spectral data is non-spectral clinical data. 

In one embodiment, said critical experimental parameters are spectral parameters. 

15 

In one embodiment, said class group comprises classes associated with said 
predetermined condition (e.g., presence, absence, degree, etc.). 

In one embodiment, said class group comprises exactly two classes. 

20 

In one embodiment, said class group comprises exactly two classes: presence of said 
predetermined condition; and absence of said predetermined condition. 

In one embodiment, said class associated with said predetermined condition is a class 
25 assodated with the presence of said predetermined condition. 

In one embodiment, said class not associated with said predetemnined condition is a 
class associated with the absence of said predetemiined condition. 

30 In one embodiment, said method further comprises the additional step of: 
(d) confirming the Identity of said diagnostic species. 

One aspect of the present invention pertain to novel diagnostic species (e.g.. biomarker) 
which are identified by such a method. 



35 
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One aspect of the present invention pertains to one or more diagnostic species 
(e.g.. biomarkers) which are identified by such a method for use in a method of 
classification (e.g., diagnosis). 

One aspect of the present invention pertains to a method of classification 
(e.g., diagnosis) which employs or relies upon one or more diagnostic species 
(e.g., biomarlcers) which are identified by such a method. 

One aspect of the present invention pertains to use of one or more diagnostic species 
(e.g. , biomaricers) which are identified by such a method in a method of classification 
(e.g., diagnosis). 

One aspect of the present invention pertains to an assay for use in a method of 
classification (e.g.. diagnosis), which assay relies upon one or more diagnostic species 
(e.g., biomarkers) which are identified by such a method. 

One aspect of the present invention pertains to use of an assay in a method of 
classification (e.g., diagnosis), which assay relies upon one or more diagnostic species 
(e.g., biomarkers) which are identified by such a method. 

Diaonostic Species -Atherosclerosis/CHD 

In one embodiment, at least one of said one or more predetermined diagnostic species Is 
a species described in Table 4-CHD. 

In one embodiment, each of a plurality of said one or more predetermined diagnostic 
species is a species described in Table 4-CHD. 

In one embodiment, each of said one or more predetermined diagnostic species is a 
species described in Table 4-CHD. 

Amount or Relative Amount 

As discussed above, many of the methods of the present Invention Involve classification 
on the basis of an amount, or a relative amount, of one or more diagnostic spedes. 
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In one embodiment, said classification is perfomied on ttie basis of an amount, or a 
relative amount, of a single diagnostic species. 

In one embodiment, said classfficatlon Is perfonned on the basis of an amount, or a 
5 relative amount, of a plurality of diagnostic species. 

In one embodiment, said classification is performed on tiie basis of an amount, or a 
relative amount, of eacii of a plurality of diagnostic species. 

10 In one embodiment, said classification is performed on tlie basis of a total amount, or a 
relative total amount, of a plurality of diagnostic species. 

In one embodiment (wherein said one or more diagnostic species is: a plurality of 
diagnostic species), said amount of, or relative amount of one or more diagnostic 
15 species is: a combination of a plurality of amounts, or relative amounts, each of which is 
the amount of, or relative amount of one of said plurality of diagnostic species. 

In one embodiment, said combination is a linear combination. 

20 The term "amount," as used in this context, pertains to the amount regardless of the 
terms of expression. 

The term "amount." as used herein in the context of " amount of, or relative amount of 
(e.g., diagnostic) species," pertains to the amount regardless of the terms of expression. 

25 

Absolute amounts may be expressed, for example, in terms of mass (e.g., pg), moles 
(e.g., pmol), volume Q.e., pL), concentration (molarity, pg/mL, pg/g, wt%, vol%, etc.), etc. 

Relative amounts may be expressed, for example, as ratios of absolute amounts (e.g., 
30 as a fraction, as a multiple, as a %) with respect to another chemical species. For 
example, the amount may expressed as a relative amount, relative to an internal 
standard, for example, another chemical species which is endogenous or added. 

The amount may be indicated indirectly, in terms of another quantity (possibly a 
35 precursor quantity) which is indicative of the amount. For example, the other quantity 
may be a spectrometric or spectroscopic quantity (e.g.. signal, intensity, absorbance. 



wo 02/086500 PCT/GB02/01854 

-94- 

transmittance, extinction coefficient, conductivfty, etc.; optionally processed, e.g., 
integrated) which itself indicative of the amount 

The amount may be Indicated, directly or indirecBy. in regard to a different chemical 
5 species (e.g., a metabolic precursor, a metabolic product, etc.), which is indicative the 
amount. 

Diagnostic Shift 

1 0 As discussed above, many of the methods of the present invention involve classification 
on the basis of a modulation, e.g., of NMR spectral intensity at one or more 
predetemriined diagnostic spectral windows; of the amount, or a relative amount, of 
diagnostic species; etc. in this context, "modulation" pertains to a change, and may be. 
for example, an increase or a decrease. In one embodiment, said "a modulation of is 

1 5 "an increase or decrease in." 

In one embodiment, the modulation (e.g., increase, decrease) is at least 10%, as 
compared to a suitable control. In one embodiment, the modulation (e.g., increase, 
decrease) is at least 20%, as compared to a suitable control. In one embodiment, the 
modulation is a decrease of at feast 50% (i.e.. a factor of 0.5). In one embodiment, the 
modulation is a increase of at least 100% (i.e., a factor of 2). 

Each of a plurality of predetemriined diagnostic spectral windows, and each of a plurality 
of diagnostic species, may have independent modulations, which may be the same or 
different For example, if there are two predetemriined diagnostic spectral windows, 
NMR spectral Intensity may Increase in one window and decrease in the ottier window. 
In this way, combinations of modulations of NMR spectral Intensity in different diagnostic 
spectral y\nndows may be diagnostia Similarly, if there are two diagnostic species, the 
amount of one may increase, and the amount of the other may decrease. Again, 
combinations of modulations of amounts, or relative amounts of. different diagnostic 
species may be diagnostic. See, for example, ttie data in the Examples below, which 
illustrate cases where different species have different modulations. 

The tenm "diagnostic shift," as used herein, pertains a modulation (e.g., increase, 
decrease), as compared to a suitable control. 
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A diagnostic shift may be in regard to, for example, HMR spectral Intensity at one or 
more predetermined diagnostic spectral windows; or the amount of, or relative amount 
of, diagnostic species. 

5 Control Samoles, Control Subjects. Control Data 

Suitable controls are usually selected on the basis of the organism (e.g.. subject, patient) 
under study (test subject study subject, etc.), and the nature of the study (e.g.. type of 
sample, type of spectra, etc.). Usually, controls are selected to represent the state of 
10 "normality." As described herein, deviations from nomiality (e.g.. higher than normal, 
fower than nomial) in test data, test samples, test subjects, etc. are used in classification, 
diagnosis, etc. 

For example, in most cases, control subjects are the same spedes as the test subject 
1 5 and are chosen to be representative of the equivalent nonnal (e.g., healthy) organism. A 
control population is a population of control subjects. If appropriate, control subjects may 
have characteristics in common (e.g., sex, ethnicity, age group, etc.) with the test 
subject. If appropriate, control subjects may have characteristics (e.g., age group, etc.) 
which differ from those of the test subject. For example, it may be desirable to choose 
20 healthy 20-year olds of the same sex and ethnicity as the study subject as control 
subjects. 

In most cases, control samples are taken from control subjects. Usually, control samples 
are of the same sample type (e.g.. seaim), and are collected and handled (e.g.. treated, 
25 processed, stored) under the same or similar conditions, as the sample under study 
(e.g., test sample, study sample). 

In most cases, control data (e.g., control values) are obtained from control samples 
which are taken from control subjects. Usually, control data (e.g., control data sets, 
30 control spectral data, control spectra, etc.) are of the same type (e.g., 1-D NMR, etc.), 
and are collected and handled (e.g.. recorded, processed) under the same or similar 
conditions (e.g., parameters), as the test data. 
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Implementatlon 

The methods of the present invention, or parts thereof, may be conveniently performed 
electronically, for example,. using a suftaWy programmed computer system. 

One aspect of the present invention pertains to a computer system or device, such as a 
computer or linked computers, operatively configured to implement a method of the 
present invention, as described herein. 

One aspect of the present Invention pertains to computer code suitable for implementing 
a method of the present invention, as described herein, on a suitable computer system. 

One aspect of the present invention pertains to a computer program comprising 
computer program means adapted to perform a method according to the present 
invention, as described herein, when said program is njn on a computer. 

One aspect of the present Invention pertains to a computer program, as described 
above, embodied on a computer readable medium. 

One aspect of the present invention pertains to a data camer which carries computer 
code suitable for implementing a method of the present Invention, as described herein, 
on a suitable computer. 

In one embodiment, the above-mentioned computer code or computer program includes, 
or is accompanied by. computer code and/or computer readable data repiesenting a 
predictive mathematical model, as described herein. 

In one embodiment, the above-mentioned computer code or computer program includes, 
or is accompanied by, computer code and/or computer readable data representing data 
from which a predictive mathematical model, as described herein, may be calculated. 

One aspect of the present Invention pertains to computer code and/or computer readable 
data representing a predictive mathematical model, as described herein. 
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One aspect of the present invention pertains to a data canier which carries computer 
code and/or computer readable data representing a predictive mathematical model, as 
described herein. 

5 One aspect of the present invention pertains to a computer system or device, such as a 
computer or linl<ed computers, programmed or loaded with computer code and/or 
computer readable data representing a predictive mathematical model, as described 
herein. 

10 Computers may be linked, for example, internally (e.g., on the same circuit board, on 
different circuit boards which are part of the same unit), by cabling (e.g., networking, 
* ethemet, intemet), using wireless technology (e.g., radio, microwave, satellite link, cell- 
phone), etc., or by a oombination thereof. 

15 Examples of data carriers and computer readable media include chip media (e.g., ROM, 
RAM, flash memory (e.g., Memory Stick™, Compact Rash™, Smartmedia™), magnetic 
disk media (e.g., floppy disks, hard drives), optical disk media (e.g., compact disks 
(CDs), digital versatile disks (DVDs), magneto-optical (MO) disks), and magnetic tape 
media. 

20 

Although the ^H-NMR spectra analysed here were generated using a conventional (and 
hence large and expensive) 600 MHz NMR spectrometer, on-going technological 
advances suggest that spectrometers of similar resolving power may soon be available 
as desktop units (provided the sample to be analyzed is small, as is the case with 
25 plasma or serum samples). Such units, together with a personal computer to perform 
automated pattern recognition, may soon be available not only in large hospitals but also 
In the primary healthcare milieu. 

One aspect of the present invention pertains to a system (e.g., an 'Integrated analyser", 
30 ''diagnostic apparatus") which comprises: 

(a) a first component comprising a device for obtaining NMR spectral intensity 
data for a sample (e.g., a NMR spectrometer, e.g.. a Bruker INCA 500 MHz); and, 

(b) a second component comprising computer system or device, such as a 
computer or linked computers, operatively configured to implement a method of the 

35 present invention, as described herein, and operatively linked to said first component. 
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In one embodiment, the first and second components are In close proximity, e.g., so as 
to fonn a single console, unit, system, etc. In one embodiment, the first and second 
components are remote (e.g., In separate rooms, in separate buildings). 

5 A simple process for the use of such a system is described below. 

In a first step, a sample (e.g., blood, urine, etc.) is obtained from a subject, for example, 
by a suitably qualified medical technician, nurse, etc., and the sample is processed as 
required. For example, a blood sample may be drawn, and subsequently processed to 
10 yield a semm sample, within about three hours. 

In a second step, the sample is appropriately processed (e.g., by dilution, as described 
herein), and an NMR spectrum is obtained for the sample, for example, by a suitably 
qualified NMR technician. Typically, this would require about fifteen minutes. 

15 

In a third step, the NMR spectrum Is analysed and/or classified using a method of the 
present invention, as described herein. This may be performed, for example, using a 
computer system or device, such as a computer or linked computers, operatively 
configured to implement the methods described herein. In one embodiment, this step is 

20 perfonned at a location remote from the previous step. For example, an NMR 

spectrometer located in a hospital or clinic may be Jinked, for example, by ethemet, 
intemet. or wireless connection, to a remote computer which performs the 
analysis/classification. If appropriate, the result is then fonwarded to the appropriate 
desfination. e.g., the attending physician. Typically, this would require about fifteen 

25 minutes. 

Applications 

The methods descritred herein can be used in the analysis of chemical, biochemical, and 
30 biological data. 

The methods described herein provide powerful means for the diagnosis and prognosis 
of disease, for assisting medical practitioners In providing optimum therapy for disease, 
and for understanding the benefits and side-effects of xenobiotic compounds thereby 
35 aiding the dmg development process. 
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Furthermore, the methods described herein can be applied in a non-medical setting; 
such as in post mortem examinations, forensic science, and the analysis of complex 
chemical mbctures other than mammalian cells or biofluids. 

Examples of these and other applications of the methods described herein include, but 
are not limited to, the following: 

Medical Diagnostic Applications 

(a) Early detection of abnormality/problem. For example, the technique can be used to 
identify subjects suffering from cerebral edema immediately on arrival in the acute 
emergency department of a hospital. At present, when patients present with head 
trauma, it is difficult to tell whether cerebral edema will be a problem: as a result, it may 
not be possible to intervene until clinical symptoms of cerebral edema become evident, 
which may be too late to save the patient 

In a similar example, patients arriving at acute emergency departments can be screened 
for internal bleeding and organ rupture, to facilitate eariy surgical intervenSon. 

In a third example, the methods described herein can be used to identify a ciinically 
silent disease (e.g.. low bone mineral density (e.g., osteoporosis); infection with 
Helicobacter Pylon) prior to the onset of clinical symptoms (e.g., fracture; development of 
ulcers). 

(b) Diagnosis (identificafion of disease), especially cheap, rapid, and non-invasive 
diagnosis. For example, the methods described herein can be used to replace treadmill 
exercise teste, echiocardiograms, elecfaiocardiograms, and invasive angiography as the 
collective method for the identification of coronary heart disease. Since the cument tests 
for coronary heart disease are slow, expensive, and invasive (with associated morbidity 
and mortality), the methods described herein offer significant advantages. 

(c) Differential diagnosis, e.g., classification of disease, severity of disease, etc., for 
example, the ability to distinguish patients with coronary artery disease affecting 1,2, or 
all 3 coronary arteries (see example below); the ability to distinguish disease at different 
anatomical sites, e.g.. in the left coronary artery versus the circumflex artery, or in the 
carotid arteries as opposed to the coronary arteries. 
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(d) Population targeting. A condition (e.g.. coronary heart disease, osteoporosis) may be 
clinically silent for many years prior to an acute event (e.g., heart attack, bone fraclur©). 
which may have significant associated morbidity or mortality. Dmgs may exist to help 
prevent the acute event (e.g., statins for heart disease, bisphosphonates for 
osteoporosis), but otten they cannot be efficiently targeted at the population level. The 
requirements for a test to be useful for population screening are that they must be cheap 
and nonnnvasive. The methods described herein are ideally suited to population 
screening. Screens for multiple diseases with a single blood sample (e.g., osteoporosis, 
heart disease, and cancer) further improve the cost/benefit ratio for screening. 

(e) Classificaflon, fingerprinting, and diagnosis of metabolic diseases (e.g., inborn errors 
of metabolism). 

(f) Identifying, classifying, detenninlng the progress of, and monitoring the treatment of, 
infectious diseases. 



(g) Characterization and identification pf dmgs used in overdose. For example, a patient 
may be unconscious following an overdose and/or the nature of the drug taken In 
overdose may not be known. The methods described herein can be used to 
characterise the biologteal consequences of the overdose and to rapidly identify 
candidate agents, fadlitating rapid intenrenUon to reverse the effects. Thus an overdose 
of opiokis could rapidly be countered with nak>xone. 

(h) Characterization and Mentificatlon of poisons, and the metabolic or biological 
consequences of poisoning. Many vIcBms of poisoning (e.g., chHdren) arB unaware of 
the nature of the substance they have taken. Furthennore. the subject may be 
unconscious or unable to communicate. The methods described herein can be used to 
characterise the biological consequences of the poisoning and to rapidly kientify 
candidate poisons. This would facilitate administration of appropriate antidote, vrfiich 
typically must be done as quickly as possible after exposure to (e.g., ingestion of) the 
toxic substance. 
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Medical Prognosis Applications 

(a) Prognosis (prediction of future outcome), including, for example, analysis of "old" 
samples to effect retrospective prognosis. For example, a sample can be used to 

5 assess the risk of myocardial infarction among sufferers of angina, pemiitting a more 
aggressive therapeutic strategy to be applied to those at greatest risk of progressing to a 
heart attack. 

(b) Risk assessment, to Identify people at risk of suffering from a particular indication. 
10 The methods described herein can be used for population screening (as for diagnosis) 

but in this case to screen for the risk of developing a particular disease. Such an 
approach will be useful where an effective prophylaxis is known but must be applied prior 
to the development of the disease in onJer to be effective. For example, 
bisphosphonates are effective at preventing bone loss in osteoporosis but they do not 
15 increase pathologically low bone mineral density. Ideally, therefore, these dmgs are 
applied prior to any bone loss occurring. This can only be done with a technique which 
facilitates prediction of future disease (prognosis). The methods described herein can be 
used to identrfy those people at high risk of losing bone mineral density in the future, so 
that prophylaxis may begin prior to disease inception. 

20 

(c) Antenatal screening for a wide range of disease susceptibiltties. The methods 
described herein can be used to analyse blood or tissue drawn from a pre-term fetus 
(e.g., during chorionic vilus sampling or amniocentesis) for the purposes of antenatal 
screening. 

25 

Aids to Theraputic Intervention 

(a) Therapeutic monitoring, e.g., to monitor the progress of treatment. For example, by 
making serial diagnostic tests, it will be possible to detemnine whether and to what extent 

30 the subject is returning to nomnal following initiation of a therapeutic regimen. 

(b) Patient compliance, e.g., monitoring patient compliance with therapy. Patient 
compfiance is often very poor, particulariy with therapies that have significant side- 
effects. Patients often claim to comply with the therapeutic regimen, but this may not 

35 always be the case. The methods described herein pennit the patient compliance to be 
monitored, both by directly measuring the dmg concentration and also by examining 
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biological consequences of the drug. Thus, the methods descnlied herein offer 
significant advantages over existing methods of monitoring compliance (such as 
measuring plasma concentrations of the dnig) since the patient may take the drog just 
prior to the investigation, while having failed to comply for previous weeks or months. By 
5 monitoring the biological consequences of therapy, it is possible to assess long-temj 
compliance. 

(c) Toxicology, including sophisticated monitoring of any adverse reactions suffered, e.g., 
on a patient-by-patient basis. This will facilitate investigation of idiosyncratic toxicity. 

10 Some patients may suffer real, clinically significant side-effects from a therapy which 
were not seen in the majority. Application of the mettiods described herein facilitate 
rapid Identiffcation of these rare, idiosyncratic toxicities so that the ttierapy can be 
discontinued or modified as appropriate. Such an approach allows the therapy to be 
tailored to the individual metabolism of each patient 

15 

(d) The methods described herein can be used for "phamiacometabonomlcs," in analogy 
to phannacogenomics, e.g., subjects could be divided into "respondere" and 
"nonresponders" using ttie metabonomic profile as evidence of "response," and features 
of tiie metabonomic piofile could tfien be used to target future patients who would likely 

20 respond to a particular tiierapeutic course. For example, patients given statins could be 
monitored using the methods described herein for beneficial changes In ttie subtie 
composition of the lipoproteins which are associated witti coronary heart disease. On 
this basis, the patients could be categorised Into "statin responsive" or "statin 
unresponsive". In a second stage, the mettiods described herein could be re-applied to 

25 the untreated metabonomfe fingerprint to Identily pattern elements which predict future 
responses to statins. Thus, the clinician would know wheOier or other patients should be 
treated with statins, wittiout having to wait weeks or monttis to assess the outcome. 



Tools for DruQ Development 

(a) Clinical evaluations of dmg therapy and efficacy. As for ttierapeubc monitoring, tfie 
methods described herein can be used as one end-point in clinical bials for efficacy of 
new therapies. The extent to which sequential diagnostic fingerprints move towards 
normal can be used as one measure of the efficacy of tiie candidate tiierapy. 
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(b) Detection of toxic side-effects of drugs and model compounds (e.g., in the drug 
development process and in clinical trials). For example, it will be possible to identify the 
major sites of toxic effects (e.g., liver, Iddney, etc.) for new treatments during Phase I 
studies, as well as identifying idiosyncratic toxicities during later stage clinical trials. 

5 

(c) Improvement in the quality control of transgenic animal models of disease; aiding the 
design of transgenic models of disease. Transgenic models of various diseases have 
been useful for the preclinical development of new therapies. Although the transgenic 
model may recapitulate many of the phenotypic mariners of the human disease, it is often 

10 unclear whether similar biochemical mechanisms underlie the resulting phenotype. 

(d) Other animal models of disease. For example, injection of bovine type II collagen 
into mice has often been used as model of rheumatoid arthritis, resulting in joint swelling 
and autoantibodies, but the mechanisms resulting in the phenotype have little in common 

15 with the human disease. As a result, therapies which are effective in the animal model 
may be ineffective in man. The methods descrit)ed herein can be used to examine the 
metabolic and phenotypic consequences of gene manipulation or other intenrentions 
used to yield an animal model of disease, and to compare those with the metabolic and 
phenotypic changes characteristic of the disease in man. and thereby validate a range of 

20 animal models of human diseases. 

(e) Searching for new biochemical markers of disease and/or tissue or organ damage. 
For example, the NMR bin around 53.22 was identified as being particulariy associated 
with coronary heart disease (see examples below), and the assodated species has been 

25 identified as a novel metat}olic marker of coronary heart disease which may be 
amenable to therapeutic intervention. 

Commercial and Other Non-Medical Applications 

30 (a) Commercial classification for actuarial assessment, to address the commerdal need 
for insurance companies to assess future risk of disease. Examples include the 
provision of health insurance and general life cover. This application is similar to 
prognostic assessment and risk assessment in population screening, except that the 
purpose is to provide accurate actuarial information. 

35 
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(b) Clinl(al trial enrollment, to address the comnierclal need for the ability to select 
individuals suffering firom, or at risk of suffering from, a particular condition for enrolment 
in clinical trials. For example, at present to perfomi a clinical trial to assess efficacy of a 
drug intended to prevent heart disease it vwuld be necessary to enroll at {east 4,000 

5 subjects and follows them for 4 years. If it were possible to select individuals who were 
suffering from heart disease, it is estimated that it would be possible to use 400 subjects 
followed for 2 years reducing the cost by 25-fbld or more. 

(c) Characterization and Identification of illicit drugs, and the metabolic or biological 
10 consequences of substance abuse. As for monitoring patient compliance with desired 

therapeutics, the methods described herein can be used to examine the metabolic 
consequences of illegal substance abuse, pennltling confinnation of the use of the 
substance, even if none of the substance or its metatx>lites are present in the system at 
the time of investigation. This circumvents the abOity to use proscribed substances 
1 5 chronically, but to temporally suspend their use to avoid being identified. This 

application could be applied to identification of habitual users of illegal drugs (such as 
heroin, cocaine, amphetamines, etc.) for police use, or for monitoring use of banned 
substances in sports (e.g.. to detect use of anabolic steroids among athletes, eta). 

20 (d) Application to pathology and post-mortem studies. For example, the methods 
described herein could be used to identify the proximate cause of death in a subject 
undergoing post-mortem examination. 



(e) Application to forensic science. For example, the methods described herein can be 
used to identify the metabolic consequences of a range of actions on a subject (vwho mi 
be either dead or alive at the tme of the Investigation). For example, the methods 
described herein can be applied to identtly metabonc consequences of asphyxiation, 
poisoning, sexual arousal, or fear. 



30 (f) Analysis of samples other than mammalian cells or biofluids. For example, the 
methods described herein can be applied to a panel of wines, classified by experts for 
their quality. By recognising patterns associated with good quality, the methods 
described herein can be used by wine manufacturers during the preparation of blends, 
as well as by wine purchasers to faalitate a rapid and independent assessment of the 

35 quality of a given wine. 
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(g) The methods described herein can also be used to identify (loiown or novei) 
genotypes and/or phenotypes, and to determine an organism's phenotype or genotype. 
This may assist with the choice of a suitable treatment or facilitate assessment of its 
relevance in a drug development process. For example, the generation of metabonomic 
5 data in panels of individuals with disease states, infected states, or undergoing treatment 
may indicate response profiles of groups of individuals which can be differentiated into 
two or more sut)groups, indicating that an allelic genetic basis for response to the 
disease, state, or treatment exists. For example, a particular phenotype may not be 
susceptible to treatment with a certain drug, while another phenotype may be susceptible 

10 to treatment. Conversely, one phenotype might show toxicity because of a failure to 

metabolise and hence excrete a drug, which drug might be safe in another phenotype as 
it does not exhibit this effect. For example, metabonomic methods can be used to 
determine the acetylator status of an organism: there are two phenotypes, corresponding 
to 'Yast" and "slow" acetylation of drug metabolites. Phenotyping can be achieved on the 

1 5 basis of the urine alone (i e., without dosing a xenoblotic), or on the basis of urine 
following dosing with a xenobiotic which has the potential for acetylation (e.g., 
galactosamine). Similar methods can also be used to detennine other differences, such 
as other enzymatic polymorphisms, for example, cytochrome P450 polymorphism. 

20 As shown below, the methods described herein can be used successfully to discriminate 
between twins, whether identical twins or non-identical twins. 

The methods described herein may also be used in studies of the biochemical 
consequences of genetic modification, for example, in "knock-out animals" where one or 

25 more genes have been removed or made non-flinctlonal; in "knock-in** animals where 
one or more genes have been incorporated from the same or a different species; and in 
animals where the number of copies of a gene has been increased, as in the model 
which results in the over-expression of the beta amyloid protein in mice brains as a 
model for Alzheimer's disease). Genes can i^e transferred t)etween bacterial, plant and 

30 animal species. 

The combination of genomic, proteomic, and metabonomic data sets into comprehensive 
"bionomic" systems may permit an holistic evaluation of perturbed in vivo function. 

35 The methods described herein may t>e used as an attemative or adjunct to other 
methods, e.g., the various genomic, pharmacogenomic, and proteomic methods. 
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The following are examples are provided solely to Illustrate the present invention and are 
5 not intended to limit the scope of the present invention, as descrit)ed herein. 

Example 1 

Diagnosis of Coron anA Heart Disease fCHD) 

10 As discussed above, the inventore have developed novel methods (which employ 
multivariate statistical analysis and pattem recognition (PR) techniques, and optionally 
data filtering techniques) of analysing data (e.g.. NMR spectra) from a test population 
which yield accurate mathematical models which may subsequently be used to classify a 
test sample or subject, and/or In diagnosis. 



15 



In the context of atherosclerosls/CHD, the inventore have applied these techniques to 
the analysis of either serum or plasma taken from individuals who have been extensively 
characterized, both for the presence of atherosclerosis/CHD by the gold-standard 
angiographic technique and also for a wide range of conventional risk factors. 
The metabonomic analysis can distinguish between individuals with and without 
atherosderosis/CHD: and/or the degree of atherosderosls/CHD. Novel diagnostic 
blomaricere for atherosclerosls/CHD have been Identified, and methods for associated 
diagnosis have been developed. 

Obtaining NMR Spectra 

Patients were recnjited to the 7VD (triple vessel disease) group who had significant 
coronary artery disease (defined as a reduction of more than 50% in the Intralumenal 
diameter) of all three coronary arteries (left anterior descending, circumflex and right 
coronary arteries). The symptoms of angina had been stable for at least one month and 
no patient had suffered a myocardial inferction in the preceding three months. 

Patients were recruited to the NCA (nonnal coronary artery) group who had chest pain 
and a positiva exercise electrocardiogram (the Bnice protocol (see. e.g.. Bmce. 1974; 
Bennan et al., 1978; Guyton. 1991) was used, where the prBsence of at least 1 mm of 
horizontal or downward sloping ST segment depression at 80 ms after the J point is 
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considered positive), but normal coronary angiograms (judged by two independent 
observers). NCA patients witii iiypertension, diabetes mellitus and valvular heart 
disease or left ventricular hypertrophy were excluded. 

5 Consecutive patients presenting at Papworth IHospital (Cambridgeshire, UK) who met 
the above criteria for either the TVD or NCA group were recruited to the study. 
36 patients with severe CIHD (TVD patients) and 30 patients with angiographically 
normal coronary arteries (NCA patients) were enrolled. The clinical data for these 
patient groups is shown in Table 2-CHD, below. For each parameter, the average value 

10 is given together with one standard deviation. 



Table 2-CHD 




TVD 


NCA 


Aae fvears) 


64.1 ±7.2 


57 2 ± 9 0 


Sex: Male (n) 


34 


7 


Sex: Female (n) 


2 


23 


Myocardial infarction 


19 


1 


Systolic Blood Pressure (mmHg) 


138 ±23 


141 ±22 


Diastolic Blood Pressure (mmHg) 


75 ±12 


78 ±12 


Smokers (n) 


1 


2 


Urea (mM) 


5.6 ±1.6 


5.0 ±1.2 


Creatinine (pM) 


108 ±18 


93± 14 


Glucose (mM) 


5.6 ± 0.9 


5.2 ± 0.6 


Total diolesterol (mM) 


6.2 ± 0.8 


5.9 ±1.1 


HDL-ciiolesterol (mM) 


0.8 ± 0.2 


1.1 ±0.2 


LDL-cholesterol (mM) 


4.5 ± 0.7 


4.3 ±1.1 


Total Choi : HDL-Chol ratio 


8.3 ± 1.9 


5.8 ± 1.8 


PAI-1 (ng/dl) 


49.1 ± 16.6 


37.9 ± 17.4 


Triglycerides (mM) 


2.1 ± 1.1 


1.5 ±1.2 


TGF-beta 


1.6 ±1.4 


4.4 ±4.8 


Total protein (g) 


69.4 ±4.0 


70.4 ±6.3 


Aiiaumin (g) 


37.4 ±2.6 


38.6 ±3.2 


% Gloiuitin 


46±4 


45±5 
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Blood was drawn from each patient, allowed to clot in plastic tubes for 2 hours at room 
temperature, and the semm was collected by centrifugatfon. Afiquots of serum were 
stored at -80*»C until assayed. 

5 Prior to NMR analysis, samples (150 pi) were diluted with solvent solution (10% D2O v^, 
0.9% NaCI w/v) (350 pi). The diluted samples were then placed in 5 mm high quality 
NMR tubes (Goss Scientific Instruments Ltd). 

Conventional l-D NMR spectra of the blood semm samples were measured on a 
10 Bruker DRX-600 spectrometer using the conditions set forth in the section entitled "NMR 
Experimental Parameters." 

NMR Experimental Parameters 

15 (a) General: 

Samples were NON-SPINNING In the spectrometer 

Temperature: 300 K 

Operating Frequency: 600.22 MHz 

Spectral Width: 8389.3 Hz 
20 Number of data points (TD): 32K 

Number of scans: 64 . 

Number of dummy scans: 4 (once only, before the start of the acquisition). 
Acquisition time: 1.95 s 

25 (b) Pulse Sequence: 

noesyprld (Bruker standard noesypresat sequence, as listed in their manual): RD - 90° 

-ti-90°-tm-90«-FID 

Relaxation delay (RD): 1.5 s 

Fixed interval (ti): 4 ps 
30 Mixing time (tn): 1 50 ms 

90* pulse length: 10.9 ^s 

Total recycle period: 3,6 s 

Secondary inadiation at the water resonance during RD and tm 
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(c) Phase Cycling 

The phase of the RF pulses and the receiver was cycled on successive scans to remove 
artefacts according to the following scheme, where PH1 refers to the first 90° pulse, PH2 
refers to the second, PH3 refers to the third and PH31 refers to the phase of the 
5 receiver. In the following scheme: 

0 denotes 0° phase increment 

1 denotes 90° phase increment 

2 denotes 180° phase increment 
10 3 denotes 270° phase increment 



PHI = 0 2 

PH2= 00 0 0 00 0 02 2 222 22 2 
PH3 = 00221 1 33 
15 PH31 = 0220133120023113 

(d) Processing of the FIDs: 

This was done using using XWINNMR (version 2.1 , Bruker GmbH, Germany). 
Automatic zero fill x 2 at end of FID. 
20 Line broadening by multiplying the FID by a negative exponential equivalent to a line 
broadening of +0.3 Hz. 
Fourier transform. 

(e) Processing of the NMR spectra: 

25 This was done using using XWINNIVIR (version 2.1 , Bruker GmbH, Germany). 

Spectrum peak phase adjusted manually using the zero and first order parameters 
PHC0,PHC1. 

Baseline corrected manually using the command "basl." This allows the subtraction of 
baselines of various degrees of polynomial. The simplest is to subtract a constant to 
30 remove a DC offset and this was suffident in the present case. In other cases, it can be 
necessary to subtract a straight line of adjustable slope or to subtract a baseline defined 
by a quadratic function. The possibility exists within the software for functions up to 
quartic in nature. 
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Once properly phased and baseline con-ected, the full spectra showed a flat featureless 
baseline on both sides of the main set of signals (i.e.. outside the range 5 0 to 10), and 
the peaks of interest showed a dear in-phase absorption profile. 

NMR chemical shifts in the spectra were defined relaUve to that of the lactate methyl 
group (the middle of the doublet, taken to be at 6 1.33). 

(0 Reduction of the NMR spectra to descriptors 

The NMR spectra In the region 6 10 - 6 o:2 were segmented into 245 regions or 
"buckets" of equal length (5 0.04) using AMIX (Analysis of Mixtures sofhivare, version 
2.5, Bruker, Germany). The Integra! of the spectrum in each segment was calculated. In 
order to remove the effects of variation in the suppression of the water resonance, and 
also the effects of variation in the urea signal caused by partial cross solvent saturation 
via solvent exchanging protons, the region 5 6,0 to 4.5 was set to zero integral. The 
following AMIX profile was used: 



command=bucket_1 d Jable 

input-file=<namesfile> 

output Jile==<mydata.amix> 

leftj>pm=10 

rightj)pm=0.2 

exdudel Jefl j>pm=6.0 

exclude1_right_j)pm=4.5 

exclude2JefLPpm= (intentionally undefined) 

exclude2_right_ppm= (intentionally undefined) 

bucket_width=0.04 

bucket_mode=0 

bucket_scale_mode=3 

bucket_multiplier=0.01 

bucket_output_fomriat=2 

normali2ation_regionjeft=10 

normalization.region_right=0.2 



TTie integral data were nonnafeed to the total spectral area using Excel (Microsoft. 
USA). Intensity was integrated over all included regions, and each region was then 
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divided by the total integral and multiplied by a constant (i.e., 100, so that final integrated 
intensities are expressed as percentages of the total intensity). 

The normalized data were then exported to the SIIWCA-P (version 8.0 Umetrics, 
5 Sweden) software package and each descriptor was mean-centered. All subsequent 
analysis was therefore performed on nomialised mean-centered data. 

Visual Analvsis of Spectra 

10 The 600 MHz NMR spectra of human sera from patients with severe CHD (TVD 
patients) and patients with angiographically normal coronary arteries (NCA patients) 
were visually compared (see. e.g.. Figure 1-CHD). Few systematic differences could be 
detected when the two groups were compared. 



15 Chemical components visible in the spectra were assigned on the basis of previously 
published data (see. e.g., Nicholson et al., 1995; Lui et al., 1997; Ala-Korpela, 1995). 
The features assigned in Rgure 1-CHD are summarised in Table 3-CHD, below. 



Table 3-CHD 


No. 


Chemical Shift 


Assignment 




(6) 




1 


0.66 


Lipid. HDL; CI 8 methyl group of HDL-C 


2 


0.84. 0.87 


Lipid, mainly LDL and VLDL; CHs 


3 


0.97. 1.02 


Valine 


4 


1.25, 1.29 


Lipid, mainly LDL and VLDL; {CH^n 


5 


1.33 


Lactate 


6 


1.46 


Alanine 


7 


1.57 


Lipid; CH2CH2CO. 


8 


1.69 


Lipid; CijUCHaC^C 


9 


1.97 


Lipid; CH2OC 


10 


2.04 


Acetyl signal from chl add glycoprotein 


11 


2.23 


Upid;Clj2C0 


12 


2.41 


Glutamine 


13 


2.52, 2.69 


Citrate 


14 


2.69 


Lipid; -C=CCH2C=C 
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15 


2.89 


Albumin lysyl 


16 


3.05 


Creatinine 


17 


3.21 


Choline 


18 


3.24 


H-2ofp2-glucose 


19 


3.3-4.0 


CH protons from glycerol, glucose, and amino acid 


20 


4.11 


Lactate 


21 


4.64 


H-1 of P-glucose 


22 


4.7 


Residual water 


23 


5.23 


H-1 of a-glucose 


24 


5.26-5.33 j 


Lipids; =CH 



Data Analysis 



To detennine whether It was possible to distinguish TVD and NCA patients on the basis 
of the NMR spectra, principal component analysis (PCA) was perfonned. 

The scores plot of PC2 and PC3 (Rgure 2A-CHD) shows that, while there was much 
overiap between the two sample classes, some clustering was evident Whilst there is 
overlap between NCA and TVD samples, some separation is evident, with NCA samples 
dominating In the upper right quadrant and TVD samples dominating in the lower left 
quadrant. Optimum separation was seen in PC2 and PC3. and hence t2 vs t3'is shown 
in Figure 2A-CHD. 

The coiresponding PCA loadings scatter plot (Rgure 2B-CHD) shows which regions of 
the NMR spectnim are responsible for causing separation between NCA and TVD 
samples; the most influential loadings are shown to be: regions 6 1.30; 6 1.22; 6 3.22; 6 
0.86; and 6 1.26. 

Following application of OSC, the TVD and NCA groups were well separated in the 
scores plot of PCI and PC2 (Rgure 2C-CHD. as compared to Fgure 2A-CHD). Here. 
NCA samples (drcles) dominate in the lower left quadrant; TVD samples (squares) 
dominate in the upper right quadrant. Optimum separation was observed in PCI and 
PC2, and hence t1 vs. t2 Is shown in Rgure 2C-CHD. 
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The corresponding loadings plot (Figure 2D-CHD) shows which regions of the NMR 
spectrum are responsible for causing separation between NCA and WD samples. 
Importantly, the same regions of the spectra that contributed to the clustering in the 
unflltered data set (Figure 2B-CHD) also contributed to the clustering seen after 
5 application of OSC (Figure 2D-CHD): 6 1 .30; 5 1 .34; 5 1 .22; 5 3.22; 6 0.86; and 6 1 .26. 

Partial least square descriminant analysis (PLS-DA) perfonned using the same data, 
following application of OSC, yielded excellent separation. The resulting scores plot of 
PC2 and PCI (see Figure 2E-CHD); here, NCA samples (circles) dominate the right 
10 hand side; TVD samples (squares) dominate the left hand side. The corresponding 
loadings plot (see Figure 2F-<)HD) shows which regions of the NMR spectrum are 
responsible for causing separation between NCA and TVD samples. Again, the same 
regions appear 5 1.30; 6 1.22; 6 1.26; 6 1.34; 6 3.22; 6 0.86; etc. 

15 A section of the variable importance plot (VIP) for the PLS-DA model calculated from 
OSC-filtered NMR data Is shown in Rgure 3A-CHD. 

The regression coefficients for the OSC filtered data are shown graphically in Figure 
3B-CHD. For the regression coefficients, a positive value indicates a relatively greater 
20 concentration of a metabolite (e.g., assigned using NIVIR chemical shift assignment 
tables) present in TVD samples and a negative value indicates a felatively lower 
concentration, both with respect to control samples. 

The regression coefficients for the PLS-DA model (whether obtained using the unflltered 
25 data or OSC-filtered data) again indicated that the same spectral regions contributed 
most strongly to the discrimination of the classes: lipid, mostly VLDL and LDL, and 
choline. 

The loadings (variables) that are most influential in causing separation between NCA 
30 and TVD samples are summarised in Table 4-CHD, below, and are listed in order of 
decreasing importance. The assignments were made by comparing the loadings with 
published tables of NMR data. 
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Table 4-CHD 


# 


Bucket 


Assignment 


Chem. Shift (ppm) and 


NMR spectral 




Region 




Multiplicity 


intensity, in TVD 




(ppm) 






vs. NCA 


1 


1.30 


lipid (CH2)„ 


1.29(m) 


increased 


2 


1.22 


lipid (CH2)„ 


1.22(m) 


decreased 


3 


1.26 


lipid (CH2)„ 


1.26(m). 1.25{m) 


increased 


4 


1.34 


lipid (CiiOn 


1.32(m) 


increased 


5 


3.22 


choline N(CH3)3* 


3.21 (s) 


decreased 


6 


0.86 


lipid (CH3) 


0.84(t), 0.87(t) 


Increased 


7 


0.90 


lipid (Ctis) 


0.91 


increased 


8 


0.82 


lipid (CH3)/ cholesterol 


0.84 


decreased 


9 


2.02 


lipid (Cil2C=C) 


2.00(m) 


increased 


10 


1.58 


lipid (CilaCHsCO) 


1.57(m) 


increased 


11 


2.22 


lipid (CiisCO) 


2.23(m) 


increased 


12 


1.98 


lipid (CHjQ^C) 


1.97{m) 


decreased 



The region at 5 3.22 is assigned to -N(CH3)3* groups in molecules containing the choline 
moiety, principally phosphatidylcholine from lipoproteins, mainly HDL, based on the 
known phospholipid content of lipoproteins. 

5 

The regions as 6 1.30, 1.22, 1.26, and 1.34 all arise from the (CH2)n chains of fatty acyl 
groups, which are present in all lipoproteins as phosholipids, cholesteryl esters, and 
triacylglyerols. The proportions of all three three classes of compounds vary across the 
types of lipoprotein. There are two broad NMR peaks in the region 6 1.34-1.22 which 
10 are usually assigned as LDL and VLDL; however, both peaks will contribute to all of 
these regions because of the peak line widths. 

Lipoproteins account for approximately 10% of total human blood protein. Lipoproteins 
are water soluble complexes comprising protein components (e.g.. apolipoproteins) and 
15 lipid components (e.g., cholesterol, cholesteryl esters, phospholipids, and triglycerides). 
Lipoproteins are often conveniently considered to comprise a hydrophobic core (primarily 
of cholesteryl esters and triglycerides) surrounded by a relatively more hydrophllic shell 
(primarily apolipoproteins, phospholipids, and unesterified cholesterol) projecting its 
hydrophiiic domains into the aqueous environment. Lipoproteins presumably serve as 
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transport proteins for lipids, such as triacylglyercols, clioiesterol (and cholesteryl esters), 
and other lipids (e.g., phospholipids). 

Several classes of lipoproteins (e.g., a, p, broad-p, pre-p) can be distinguished in human 
5 blood, according to their electrophoretic behaviour. However, lipoproteins are more 
conveniently characterized by their ultracentrifugation behavior in high-salt media, as 
described by their flotation constants (densities), as follows: chylomicra, less than 1.006 
g/mL; very low density (VLDL), 1.006-1019 g/mL; low density (LDL). 1.019-1.063 g/mL; 
high density (HDL), 1.083-1.21 g/mL; very high density (VHDL). >1.21 g/mL 

10 Lipoproteins are often approximately spherical in shape, and range in diameter from 

about 0.1 micron (for chylomicra) to about 5 nanometers (for VHDL). Lipoproteins range 
in molecular weight from 200 led to 10,000 kd and from 4 to 95% lipid (the higher the 
density the lower the lipid content). Chylomicra and VLDL^ are rich in triglycerides 
(--90% and -60% of the total lipid content, respectively), while LDLs are rich in 

15 cholesterol (~60% of total lipid content) and HDLs are rich In phospholipids (-50% of 
total lipid content). 

Choline (HO-CH2CHrN(CH3)3*) is incorporated into many biologically important species, 
including phosphorylcholine, glycerophosphocholtne and phosphatidylcholine (e.g., 
20 phospholipids). Phospholipids are components of lipid membranes and also of 
lipoproteins. The predominant choline-containing species in blood plasma are 
phosphatidylcholines. 

Validation 

25 

Having established the presence of "clusters" by PCA. the data were analysed by PLS- 
DA to test the predictive power of the model. 

For cross-validation purposes, training sets comprising approximately 80% of the 
30 samples under study (selected randomly) were constructed, and used to predict the 
dass of the remaining 20% of the samples. Approximately 80% of the samples were 
selected at random to constmct a PLS-DA model which could then be used to predict the 
class membership of the remaining 20% of samples. Class membership was predicted 
using a 0.5 dividing line between the two classes and a class membership probability 
35 value > 0.01 (99% confidence interval). 
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The PLS-DA model calculated for the OSC-fittered data was then used to predict the 
dass membership of the samples not included in the training set (Figure 4-CHD}. Using 
approximately 80% of the NCA (circles) and TVD (squares) samples, a PLS-DA model 
was calculated and used to predict the presence of TVD in the remaining 20% of 
5 samples (the validation set) (triangles, NCA or TVA as marked). The y-predicted scatter 
plot assigns samples to either class 1 (in this case, corresponding to TVD) or class 0 (in 
this case, con-esponding to NCA); 0.5 is the cut-off. The PLS-DA model predicted the 
presence and absence of TVD with a sensitivity of 92% and a specificity of 93% based 
on a 99% confidence limit for class membership. 

10 

This demonstrates that ^H-NMR based metabonomic analysis of plasma samples, in 
itself minimally invasive and non-destructive of sample, can achieve clinically useful 
diagnostic performance, when compared to invasive angiography. 

15 This example demonstrates that it is possible to completely separate CHD patients with 
stenosis of all three major arteries from subjects with normal coronary arteries using 
principle component analysis (PCA). 

Furthermore, using the supervised PLS-DA algorithm, it is possible to predict the artery 
20 status of unknown samples using a training set that composed only 24 NCA and 30 TN/D 
individuals. The small size of the training set required to achieve >90% sensitivity and 
specifidty highlights the power of this technique. Substantially larger training sets 
obtained through application of this technique to clinical practice should further improve 
the diagnostic sensitivity and specificity of the technique. 

25 

While the peaks around 5 1.30 are known to result predominantly from lipid CHa 
resonances, the values of the NMR descriptors in this region only correlate weakly vinth 
the level of LDL-cholesterol (r^ = 0.20). This means that there is conskleFable NMR 
signal intensity infomiation in these windows which Is unconnelated with the level of 

30 LDL-cholesterol. This arises from the presence of some small molecule metabolites 
such as lactate and threonine and also contributions from other lipoproteins (mainly 
VLDL) present in the biofluid. The line widths of the LDL and VIJ3L CHz peaks are such 
that the two peaks overiap considerably and both will contribute to ail of the windows in 
this region to varying amounts. The remaining variance Is likely to result from subtle 

35 chemical differences In the lipid composition of LDL particles between individuals, for 
example, degree of fatty acid side chain unsaturation and lipoproteln-protein molecular 
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interactions. Such observations will contribute to on-going studies using both NMR and 
other analytical techniques to understand the contribution of lipoprotein particle 
composition to the development of CHD. It does, however, emphasize an important 
facet of high data density metabolic analysis in that it Is entirely unnecessary to 
5 understand fiiliy the complex molecular differences that underlie the spectral features 
associated with CHD to be able to conrectly classify individuals with very high sensitivity 
and specificity. Further analysis of the molecular basis of the spectral differences, 
however, will give insight into the mechanistic processes involved. 

10 Example 2 

Determination of Severity of Coronarv Heart Disease (CHD) 

As discussed above, the inventors have developed novel methods (which employ 
multivariate statistical analysis and pattem recognition (PR) techniques, and optionally 
15 data filtering techniques) of analysing data (e.g., NMR spectra) from a test population 
which yield accurate mathematical models which may subsequently be used to classify a 
test sample or subject and/or in diagnosis. 

In the context of atherosclerosis/CHD, the Inventors have applied these techniques to 
20 the analysis of either serum or plasma taken from individuals who have been extensively 
characterized, both for the presence of atherosclerosis/CHD by the gold-standard 
angiographic technique and also for a wide range of conventional risk factors. 
The metabonomic analysis can distinguish between individuals with and without 
atherosclerosis/CHD; and/or the degree of atherosclerosis/CHD. Novel diagnostic 
25 biomaricers for atherosclerosis/CHD have been identified, and methods for associated 
diagnosis have been developed. 

Obtaining NMR Soectra - Severity of CHD 

30 To determine whether ^H NMR based metabonomic analysis could distinguish the 
severity of CHD present, samples were collected from individuals with stenosis of one, 
two or three major coronary arteries. Although this is a crude indicator of disease 
severity, it is plausible that the number of vessels stenosed correlated (at least weakly) 
with whole body atherosclerotic plaque load. 

35 
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Using plasma from 76 patients (28 with 1 vessel stenosed: type "1" vessel disease; 20 
with 2 vessels stenosed: type "2" vessel disease; 28 with 3 vessels stenosed: type "3" 
vessel disease), NMR spectral analysis was used to dassify the severity of CHD. 
The methods for collection of samples; NMR spectroscopy; data processing; and pattern 
5 recognition methods were all as described above, unless specified othen^rise. 

Patents were recruited according to the same criteria as described above, except that 
patients with more than 50% stenosis of either one, two or all three coronary arteries 
(assessed by two independent observers) were recruited and females were excluded. 
10 The clinical data that were measured (conventionally) for these patient groups are shown 
in Table 5-CHD. below. For each parameter, the average value is given together with 
one standard deviation. 



Table 5-CHD 


# 


Parameter 


Type"1" 


Type "2" 


Type "3" 


1 


Number (n) (all male) 


28 


20 


28 


2 


Height (m) 


1.76 ±0.07 


1.80 ±0.05 


1.7810.06 


3 


Weight (l<g) 


83.5 ±14.7 


91.1 ± 10.0 


86.7 ± 9.6 


4 


BIWI (kg/m^) 


26.77 ±4.01 


28.07 ±3.55 


27.32 ± 2.22 


5 


Erythrocytes 


4.64 ±0.35 


4.54 ±0.55 


4.66 1 0.25 


6 


l-iaemoglobin (g d/L) 


13.9 ±0.82 


13.53 ± 1.52 


13.5410.95 


7 


(Hematocrit 


0.418 ±0.026 


0.410 ± 0.053 


0.409 1 0.025 


8 


MCV(fl) 


90.2 ±4.3 


90.2 ±4.3 


87.7 1 5.3 


9 


MCHC (g d/L) 


30.1 ± 1.6 


29.8 ±1.5 


29.1 12:0 


10 


Platelets (lO'/L) 


210 ±45 


210127 


214157 


11 


Leutocytes 


6.30 i 1.21 


6.7411.74 


6.2211.50 


12 


Neutrophils 10*/L 


3.6310.89 


4.0911.77 


3.61 1 1.14 


13 


Lymphocytes (lO'/L) 


1.88 ±0.52 


1.8410.55 


1.7910.44 


14 


Monocytes (10*/L) 


0.53 ±0.14 


0.5110.17 


0.5310.14 


15 


Eosinophils (IC/L) 


0.21 ±0.12 


0.1910.12 


0.1610.10 


16 


Basophils (IC'/L) 


0.02 ± 0.01 


0.0210.01 


0.0210.01 


17 


LUC 


0.08 ±0.03 


0.08 ± 0.04 


0.09 1 0.05 


18 


Rbrinogen 


3.52 ±0.86 


3.7611.01 


3.57 ± 0.84 


19 


PTtest(s) 


13.6 ±0.9 


13.6 ±1.2 


13.710.8 


20 
_. 


APTTtest 


29.012.9 


30.114.0 


30.213.1 
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Table 5-CHD 


it 


Parameter 


Type 1 


- If All 

Type ^ 


T. — II Oil 

Type 3 


21 


Sodium (mmol/L) 


140 ±2 


139 ±2 


140 ±2 


22 


Potassium (mmol/L) 


4.1 ± 0.3 


4.1 ± 0.2 


4.2 ± 0.3 


23 


Urea (mmol/L) 


6.1 ± 1.7 


6.6 ±1.4 


6.1 ±1.3 


24 


Creatinine (pmol/L) 


104 ±10 


103 ±10 


107 ±11 


25 


Protein (g/L) 


72 ±4 


72 ±6 


72 ±3 


26 


Albumin (gA.) 


42±3 


41 ±4 


42 ±3 


27 


Immunoglogulins (g/L) 


31 ±4 


30±5 


30±3 


28 


Bilirubin (pmol/L) 


9±4 


11 ±4 


10±4 


29 


ALT(U/L) 


19±6 


23 ±10 


22 ±8 


30 


ALP (U/L) 


183 ±41 


178 ±39 


173 ±41 


31 


YGt(U/L) 


12.1 ± 7.0 


14.0 ±10.3 


12.9 ± 7.5 


32 


Glucose (mmol/L) 


5.8 ±1.3 


5.9 ±1.4 


6.1 ± 2.3 


33 


HbA1c 


5.6 ± 0.5 


5.9 ±1.3 


6.3 ± 0.6 


34 


Cholesterol (mmol/L) 


5.3 ± 0.9 


5.6 ±1.4 


5.2 ± 0.9 


35 


LDL-C (mmol/L) 


3.3 ± 0.8 


3.6 ±1.3 


3.2 ± 0.9 


36 


HDL-C (mmol/L) 


1.01 ±0.23 


0.97 ±0.17 


1.04 ±0.34 


37 


Triglycerides (mmol/L) 


2.0 ±1.1 


2.2 ±1.0 


2.1 ±0.8 



Blood samples from these patients were drawn Into Diatube H tubes, and platelet-poor 
plasma was prepared as previously described. Aliquots of plasma were stored at -80^C 
until assayed. 

5 

Samples were obtained, and 1-D NMR spectra were collected using the same 
methods and parameters as described in the NCA/TVD section. 

Data Analysis 

10 

A prindpal components analysis (PCA) model was calculated using 1-D NMR spectra 
for serum samples from patients with either 1, 2, or 3 vessels stenosed (i.e., type "1". 
type "2", and type "3" vessel disease, respectively). 

15 The scores scatter plot for the PCA model is shown in Rgure 5A-CHD. Whilst there is 
much overiap between the three classes of sample, some separation is evident 
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particulariy for the type "1" vessel disease samples which dominating the lower left of the 
plot Optimum separation was observed in PC2 and PCI . hence t2 vs. t1 is plotted in 
the figure. 

The corresponding loadings plot is shown in Figure 5B-CHD, which shows which regions 
of the NIWR spectrum are responsible for causing separation between the three different 
degrees of severity of CHD. Due to the extent of overlap, the loadings plot Is difficult to 
interpret, however, the most influential loadings are regions: 3.22; 1.38; 1.34; 1.30; 1.26; 
1.22; 0.90; 0.86; and 0.82 ppm. 

Improved separation is possible using PLS-DA (rather than the unsupervised PCA). Due 
to the fact that the pattern recognition software pacl^age (SIMCA) displays data only in 
2-dimensions, and in this example there are three sample classes, it is necessary to plot 
two classes at a time calculated for, e.g., PLS-DA models. A scores plot and the 
corresponding loadings for each pair ("1" and "2"; "1" and "3"; "2" and "3**) is shown in 
Rgure 5C-CHD. There remains much overlap between the classes; however, some 
separation is evident. 

Another PCA model was calculated using the same data. However, prior to PCA, the 
NMR data were filtered by application of OSC which serves to remove variation that is 
not correlated to dass and therefore improves subsequent multivariate analysis. 

The scores scatter plot for the resulting PCA model is shown in Rgure 6A-CHD. The 
Improved separation between the classes of different severity of CHD Is evident, with 
type "1" vessel disease dominating in the lower left quadrant. 

The conresponding loadings scatter plot is shown in Rgure 6B-CHD, which shows which 
regions of the NMR spectrum are responsible for distinguishing severity of CHD. 
Importantly, it is the same regions as for distinguishing NCA from TVD that are depicted 
in Rgure 5B-CHD, namely: 3.22; 1.38; 1.34; 1.30; 1.26; 122; 0.90; 0.86; and 0.82 ppm. 

Again, improved separation is possible using PLS-DA (rather than the unsupervised 
PCA). A scores plot and the con-esponding loadings for each pair ("1" and "2"; "1" and 
o3«. «2" and "3") is shown in Rgure 6C-CHD. Most separation is observed between types 
"r and "2" (Rgure 6C-(1)-CHD) and types "1" and ^3" (Rgure 6C^5)-CHD). This 
suggests that the metabolic profile (NIVIR spectrum) for type "1" vessel disease differs 
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the most compared to the profiles for type "2" and type "3", which are more similar to 
each other. 

Pairs of variable importance plots (ViPs) and regression coefficient plots for each of the 
5 three PLS-DA models described in Figure 6C-(1)-CHD through (6)-CHD are shown in 
Figure 7-(1)-CHD through (6)-CHD. 

The regression coefRclents in the loadings plots indicated that spectral windows ca. 5 
1.30 and 5 1.26, dominated by lipid resonances, contributed to most of the separation 
1 0 between the severity classes, with the window at 5 3.22 (choline) being relatively less 
important than in the comparison of TVD and NCA patients. 

Validation 

1 5 Y-predicled scatter plots for the OSC-PLS-DA models are shown in Figure 8A-CHD, 
Figure 8B-CHD. and Figure 8C-CHD, and these demonstrate the ability of NMR 
based melabonomlcs to predict class membership (severity of CHD; 1, 2 or 3 vessels 
affected) of unknown samples. For each plot, about 80 % of the total number of samples 
were used to calculate a PLS-DA model which was then used to predict the severity in 

20 the remaining 20% of the samples. The y -predicted scatter plots assign samples to 
either dass 1 or class 0; and the cut-off is 0.5. 

The type "1" and type "2" vessel disease PLS-DA model (Figure 8A-CHD) predicted the 
severity accurately in 88% of cases. Furthermore, for a two-component model, severity 
25 was predicted with a significance level ^0% using a 99% confidence limit. 

The type "2" and type "3" vessel disease PLS-DA model (Figure 8B-CHD) predicted the 
severity accurately in 88% of cases. Furthermore, for a two-component model, severity 
was predicted with a significance level ^85% using a 99% confidence limit. 

30 

The Type "V and type "3" vessel disease PLS-DA model (Figure 8C-CHD) predicted the 
severity accurately in 75% of cases. Furthermore, for a two-component model, severity 
was predicted viAh a significance level ^92% using a 99% confidence limit. 

35 This metabonomic analysis can distinguish individuals with different severity of CHD. 
Even using the crude parameter of number of major coronary vessels with >50% 
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stenosis, this example demonstrates that both PCA and PLS-DA are capable of 
categorizing CHD patients on the basis of severity. The failure to achieve complete 
separation of the classes is as likely to reflect the crude nature of the severity 
designations based solely on coronary angiography as on any lack of power in the 
5 nnetabonomic analysis to discriminate individuals. 

Example 3 (Comparison Example) 
Use of Established Clinical Risk Factors 

10 In this example, multivariate data analysis was used to classify the severity of CHD on 
the basis of established clinical parameters. 

This allows direct comparison of the perfonnance of the metabonomic analysis as a 
diagnostic technique with algorithms based on conventional risk factors. 

15 

A PCA model was calculated using established clinical parameters measured for 
patients with 1 , 2 or 3 vessels stenosed. The scores scatter plot for PCI and PC2 is 
shown in Figure 9A-CHD. The PCA model shows there is much overiap between the 
samples, and no separation Is evident; compare this with Rgure 5A-CHD and Figure 
BA-CHD. There is rro evidence of separation in the PCA scores plot, suggesting that 
dinteal parameters do not distinguish between "1", "2", or "3" vessel disease. 

The corresponding loadings plot is shown in Rgure 9B-CHD, and shows which of the 
established clinical, are responsible for causing separatton between the three different 
degrees of sewrity of CHD. Due to the extent of overiap. the loadings ptot is difficult to 
interpret. 

Improved separation is possible using PLS-DA (rather than the unsupervised PCA). Due 
to the fact that the pattem recognition package (SIMCA) displays data only In 
2-dlmensions, and In this example there are three sample classes, it Is necessary to plot 
two classes at a time calculate for, e.g.. PLS-DA models. A scores plot and the 
conesponding loadings Ibreach pair is shown in Rgure 9C-CHD. As can be seen fi-om 
the figures, the separation based on established ciinclal parameters Is not as evident as 
it was based on NMR data. 
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Pairs of variable importance plots (VIPs) and regression coefficient plots for each of the 
three PLS-DA models described in Figure 9C-(1)-CHD through (6)-CHD are shown In 
Figure 10-(1)-CHD through (6)-CHD. 

5 None of the risk factors measured (Including age, blood pressure, LDL and HDL 

cholesterol, total cholesterol, total triglyceride, fibrinogen, PAM, white blood cell count, 
creatinine or history of cigarette smoking) were significantly different between the three 
groups (p>0.05 by ANOVA in each case). 

10 This demonstrates that ^H-NMR based metabonomic methods described above are 
substantially better able to distinguish the severity of CHD based on a single blood 
sample than any of the conventional risk factors yet identified. 

No other conventional risk factors measured in these subjects (including age, blood 
15 pressure, lipoprotein levels or clotting parameters) differed between the severity classes, 
even in a cross-sectional analysis, and hence were completely unable to distinguish 
individuals within the population on the basis of CHD severity. This demonstrates the 
extent to which metabonomics improves upon conventional risk factor analysis. 

20 *** 

The foregoing has described the principles, prefenred embodiments, and modes of 
operation of the present invention. However, the invention should not be construed as 
limited to the particular embodiments discussed. Instead, the above-described 
25 embodiments should be regarded as illustrative rather than restrictive, and it should be 
appreciated that variations may be made in those embodiments by workers skilled in the 
art without departing from the scope of the present invention as defined by the appended 
claims. 



30 
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CLAIMS 

1 . A method of dassifying a sample, said method comprising tlie step of relating 
NMR spectral intensity at one or more predetemnined diagnostic spectral 
windows for said sample with a predetemnined condition associated with 
atherosclerosis/coronary heart disease. 

2. A method, according to claim 1 , of classifying a sample from a subject, said 
method comprising the step of relating NMR spectral intensity at one or more 
predetermined diagnostic spectral windows for said sample with a predetermined 
condition associated with atherosclerosis/coronary heart disease of said subject. 

3. A method, according to daim 1 , of classifying a sample, said method comprising 
the step of relating NMR spectral Intensity at one or more predetermined 
diagnostic spectral virindows for said sample with the presence or absence of a 
predetermined condition associated with atherosderosis/coronary heart disease. 

4. A method, according to daim 1 , of classifying a sample from a subject, said 
method comprising the step of relating NMR spectral intensity at one or more 
predetermined diagnostic spectral windows for said sample with the presence or 
absence of a predetermined condition assodated with atherosclerosis/coronary 
heart disease of said subject 

5. A method, according to daim 1 , of classifying a sample, said method comprising 
the step of relating a modulation of NMR spectral intensity, relative to a control 
value, at one or more predetermined diagnostic spectral windows for said sample 
wiUi a predetermined condition assodated with atherosderosis/coronary heart 
disease. 

6. A mettiod. according to daim 1 , of classifying a sample from a subject, said 
metfiod comprising tiie step of relating a modulation of NMR spectral intensity, 
relative to a control value, at one or more predetermined diagnostic spectral 
windows for said sample with a predetermined condition assodated with 
atherosderosis/coronary heart disease of said subject 
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A method, according to dalm 1 , of classrtying a sample, said method comprising 
the step of relating a modulation of NMR spectral Intensity, relative to a control 
value, at one or more predetemiined diagnostic spectral windows for said sample 
with the presence or absence of a predetemiined condition assodated with 
atherosderosis/coronary heart disease. 

A method, according to claim 1, of classifying a sample from a subject, said 
mettiod comprising the step of relating a modulation of NMR spectral intensity, 
relative to a control value, at one or more predetermined diagnostic spectral 
windows for said sample with tiie presence or absence of a predetemiined 
condition assodated wiUi aflierosderosis/coronary heart disease of said subject. 



A mettiod of dassifying a subject, said mettiod comprising the step of relating 
NMR spectral intensity at one or more predetemnlned diagnostic spectral 
windows for a sample from said subject with a predetermined condition 
assodated with attierosderosis/coronary heart disease of said subject. 

A method, according to daim 9, of dassifying a subject, said method comprising 
tiie step of relating NMR spedral intensity at one or more predetemiined 
diagnostic spectral windows for a sample from said subject with the presence or 
absence of a predetermined condition assodated wiUi atiierosderosis/coronary 
heart disease of said subject. 

A mettiod, according to dalm 9, of dassifying a subject, said method comprising 
ttie step of relating a modulation of NMR spectral intensity, relative to a conft-ol 
value, at one or more predetemfilned diagnostic spectral windows for a sample 
from said subjed witti a predetemiined condition assodated with 
atherosderosis/coronary heart disease of said subject 

A mettiod. according to daim 9. of dassifying a subject, said method comprising 
ttie step of relating a modulation of NMR spedral intensity, relative to a conb-ol 
value, at one or more predetemiined diagnostic spectral windows for a sample 
from said subjed wltti ttie presence or absence of a predetermined condition 
assodated with attierosderosis/coronary heart disease of said subject 
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* * * 



13. A method of diagnosing a predetermined condition associated with 
5 atherosclerosis/coronary heart disease of a subject, said method comprising the 

step of relating NMR spectral intensity at one or more predetemiined diagnostic 
spectral windows for a sample from said subject with said predetermined 
condition of said subject. 

10 14. A method, according to claim 1 3, of diagnosing a predetermined condition 

associated with atherosclerosis/coronary heart disease of a subject, said method 
comprising the step of relating NMR spectral intensity at one or more 
predetennined diagnostic spectral windows for a sample from said subject with 
the presence or absence of said predetermined condition of said subject. 

15 

15. A method, according to daim 13, of diagnosing a predetermined condition 
associated with atherosclerosis/coronary heart disease of a subject, said method 
comprising the step of relating a modulation of NMR spectral intensity, relative to 
a control value, at one or more predetenmined diagnostic spectral windows for a 

20 sample from said subject with said predetermined condition of said subject. 

16. A method, according to claim 13. of diagnosing a predetemnined condition 
associated with atherosclerosis/coronary heart disease of a subject, said method 
comprising the step of relating a modulation of NMR spectral intensity, relative to 

25 a control value, at one or more predetemilned diagnostic spectral windows for a 

sample from said subject with the presence or absence of said predetermined 
condition of said subject. 



30 

17. A method of dassl^ng a sample, said method comprising the step of relating the 
amount of, or relative amount of one or more diagnostic species present in said 
sample with a predetermined condition associated with atherosclerosis/coronary 
heart disease. 



35 
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18. A method, according to daim 17, of classitying a sample from a subject, said 
method comprising the step of relating the amount of, or relative amount of one 
or more diagnostic species present in said sample with a predetermined condition 
associated with atherosclerosis/coronary heart disease of said subject. 

5 

19. A method, according to claim 17, of classifying a sample, said method comprising 
the step of relating the amount of, or relative amount of one or more diagnostic 
species present In said sample with the presence or absence of a predetennined 
condition associated with atherosclerosis/coronary heart disease. 

10 

20. A method, according to claim 17, of classifying a sample from a subject, said 
method comprising the step of relating the amount of, or the relative amount of, 
one or more diagnostic species present In said sample with the presence or 
absence of a predetennined condition associated with attierosclerosis/coronary 

15 heart disease of said subject 



21. A method, according to daim 17. of dasstfying a sample, said method comprising 
the step of relating a modulation of the amount of. or relative amount of one qr 
more diagnostic spedes present in said sample, as compared to a control 
sample, with a predetemnlned condition assodated with atherosclerosis/coronary 
heart disease. 



22. A method, according to daim 1 7, of classifying a sample from a subject, said 
method comprising the step of relating a modulation of the amount of, or relative 
amount of one or more diagnostic spedes present in said sample, as compared 
to a control sample, with a predetennined condition assodated witti 
atherosderosis/coronary heart disease of said subject 

23. A metiiod, according to daim 17, of dassifying a sample, said method comprising 
tiie step of relating a modulation of tiie amount of. or relative amount of one or 
more diagnostic species present in said sample, as compared to a control 
sample, witii ttie presence or absence of a predetermined condition assodated 
with atherosderosis/coronary heart disease. 



35 24. 



A method, according to daim 17, of dassHying a sample from a subject, said 
method comprising ttie step of relating a modulation of tiie amount of, or relative 
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amount of one or more diagnostic species present in said sample, as compared 
to a control sample, with the presence or absence of a predetemnlned condition 
associated with atherosclerosis/coronary heart disease of said subject. 



25. A method of classifying a subject, said method comprising the step of relating the 
amount of, or relative amount of one or more diagnostic species present in a 
sample from said subject with a predetemiined condition associated with 

1 0 atherosclerosis/coronary heart disease of said subject. 

26. A method, according to claim 25, of classifying a subject, said method comprising 
the step of relating the amount of, or relative amount of one or more diagnostic 
species present In a sample from said subject with the presence or absence of a 

1 5 predetermined condition associated with atherosclerosis/coronary heart disease 

of said subject. 

27. A method, according to claim 25, of classifying a subject, said method comprising 
the step of relating a modulation of the amount of, or relative amount of one or 

20 more diagnostic species present in a sample from said subject, as compared to a 

control sample, with a predetermined condition associated with 
atherosclerosis/coronary heart disease of said subject. 

28. A method, acconling to claim 25, of classifying a subject, said method comprising 
25 the step of relating a modulation of the amount of. or relative amount of one or 

more diagnostic species present in a sample from said subject, as compared to a 
control sample, with the presence or absence of a predetermined condition 
assodated with atherosclerosis/coronary heart disease of said subject 

30 *** 

29. A method of diagnosing a predetennined condition associated with 
atherosclerosis/coronary heart disease of a subject, said method comprising the 
step of relating the amount of. or relative amount of one or more diagnostic 

35 species present In a sample from said subject with said predetenmined condition 

of said subject. 
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A method, according to daim 29, of diagnosing a predetermined condition 
associated wfth atherosclerosis/coronary heart disease of a subject, said method 
comprising the step of relating the amount of, or relative amount of one or more 
diagnostic species present in a sample from said subject with the presence or 
absence of said predetemnined condition of said subject. 

A method, according to claim 29. of diagnosing a predetennined condition 
associated with atherosderosis/coronary heart disease of a subject, said method 
comprising the step of relating a modulation of the amount of. or relative amount 
of one or more diagnostic species present in a sample from said subject, as 
compared to a control sample, with said predetennined condition of said subject. 

A method, according to daim 29, of diagnosing a predetennined condition 
associated vwth atherosderosis/coronary heart disease of a subject, said method 
comprising the step of relating a modulation of flie amount of, or relative amount 
of one or more diagnostic spedes present In a sample from said subject, as 
compared to a control sample, with the presence or absence of said 
predetermined condition of said subject. 



* * * 



A method of dasslfication, said method comprising the steps oft 

(a) forming a predictive mathematical model by applying a modelling 
method to modelling data; 

(b) using said model to dassify a test sample. 

A method, according to daim 33, of dassifying a test sample, said method 
comprising the steps of: 

(a) fomriing a predictive mathematical model by applying a modelling 
method to modelling data; 

wherein said modelling data comprises a plurality of data sets for 
modelling samples of known dass; 

(b) using said model to dassify said test sample as being a member of 
one of said known dasses. 
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35. A method, according to daim 33, of classifying a test sample, said method 
comprising the steps of: 

(a) fomolng a predictive mathematical model by applying a modelling 
method to modelling data; 

5 wherein said modelling data comprises at least one data set for each of a 

plurality of modelling samples; 

wherein said modelling samples define a class group consisting of a 
plurality of classes; 

wherein each of said modelling samples is of a known class selected from 
10 said class group; and. 

(b) using said model with a data set for said test sample to classify said 
test sample as being a member of one class selected from said class group. 

36. A method of classification, said method comprising the step of: 
1 5 using a predictive mathematical model; 

wherein said model is fonned by applying a modelling method to 
modelling data; 

to classify a test sample. 

20 37. A method, according to daim 36, of dassifying a test sample, said method 
comprising the step of: 

using a predictive mathematical model; 

wherein said model is formed by applying a modelling method to 
modelling data; 

25 wherein said modelling data comprises a plurality of data sets for 

modelling samples of known class; 

to dassify said test sample as being a member of one of said known 
dasses. 



30 38. A method, according to daim 36, of dassifying a test sample, said method 
comprising the step of: 

using a predictive mathematical model; 

wherein said model is fonned by applying a modelling method to 
modelling data; 

35 wherein said modelling data comprises at least one data set for each of a 

plurality of modelling samples; 
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wherein said modelling samples define a class group consisting of a 
plurality of classes; 

wlierein each of said modelling samples is of a known class selected from 
said class group; 

with a data set for said test sample to classify said test sample as being a 
member of one class selected from said class group. 



« * * 



A method of classification, said method comprising the steps of: 

(a) forming a predictive mathematical model by applying a modelling method to 
modelling data; 

(b) using said model to classify a subject. 

A method, according to claim 39, of dasstfying a subject, said method comprising 
the steps of: 

(a) fomfitng a predictive mathematical model by applying a modelling 
method to modelling data; 

wherein said modelling data comprises a plurality of data sets for 
modelling samples of known class; 

(b) using said model to classify a test sample from said subject as being a 
member of one of said known classes, and thereby classify said subject. 

A method, according to dalm 39, of classifying a subject, said method comprising 
the steps o^ 

(a) fonning a predictive mathematical model by applying a modelling 
method to modelling data; 

wherein said modelling data comprises at least one data set for each of a 
plurality of modelling samples; 

wherein said modelling samples define a dass group consisfing of a 
plurality of dasses; 

wherein each of said modelling samples is of a known dass selected from 
said class group; and, 

(b) using said model with a data set for a test sample from said subject to 
dassify said test sample as being a member of one dass selected from said 
dass group, and thereby dassify said subject. 
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42. A method of classification, said metiiod comprising tlie step of. 

using a predictive mathematical model; 
wherein said model is fpnned by applying a modelling method to 
5 modelling data; 

to classify a subject. 

43. A method, according to dalm 42, of classifying a subject, said method comprising 
the step of: 

10 using a predictive mathematical model 

wherein said model Is fonned by applying a modelling method to 
modelling data; 

wherein said modelling data comprises a plurality of data sets for 
modelling samples of known class; 
15 to classify a test sample from said subject as being a member of one of 

said known classes, and thereby classify said subject. 

44. A method, according to daim 42, of dassifying a subject, said method comprising 
the step of: 

20 using a predictive mathematical model, 

wherein said model is formed by applying a modelling method to 
modelling data; 

wherein said modelling data comprises at least one data set for each of a 
plurality of modelling samples; 
25 wherein said modelling samples define a class group consisting of a 

plurality of dasses; 

wherein each of said modelling samples is of a known dass selected from 
said dass group; 

with a data set for a test sample from said subject to dassify said test 
30 sample as being a member of one dass selected fixDm said dass group, and 

thereby dassffy said subject 



35 45. A method of diagnosis, said method comprising the steps of: 
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(a) forming a predictive mathematical model by applying a modelling 
method to modelling data; 

(b) using said model to diagnose a subject. 

A method, according to claim 45, of diagnosing a predetermined condition 
associated with atherosclerosis/coronary heart disease of a subject, said method 
comprising the steps of: 

(a) fomriing a predictive mathematical model by applying a modelling 
method to modelling data; 

wherein said modelling data comprises a plurality of data sets for 
modelling samples of known class; 

(b) using said model to classify a test sample from said subject as being a 
member of one of said known classes, and thereby diagnose said subject 

A method, according to daim 45, of diagnosing a predetermined condition 
associated with atherosclerosis/coronary heart disease of a subject, said method 
comprising the steps of: 

(a) forming a predictive mathematical model by applying a modelling 
method to modelling data; 

wherein said modelling data comprises at least one data set for each of a 
plurality of modelling samples; 

wherein said modelling samples define a class group consisting of a 
plurality of classes; 

wherein each of said modelling samples is of a known dass selected from 
said dass group; and, 

(b) using said model with a data set for a test sample from said subject to 
dassify said test sample as being a member of one dass seleded from said 
dass group, and thereby diagnose said subjed. 

A method of diagnosis, said method comprising the step of. 

using a predictive mathematical model; 

wherein said model is formed by applying a modelling method to 
modelling data; 

to diagnose a sutjed. 
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49. A method, according to claim 48, of diagnosing a predetemiined condition 

associated with atherosclerosis/coronary heart disease of a subject, said method 
comprising the step of: 

using a predictive mathematical model; 
5 wherein said model is formed by applying a modelling method to 

modelling data; 

wherein said modelling data comprises a plurality of data sets for 
modelling samples of known class; 

to classify a test sample from said subject as being a member of one of 
10 said known classes, and thereby diagnose said subject. 



50. A method, according to claim 48, of diagnosing a predetermined condition 

associated with atherosclerosis/coronary heart disease of a subject, said method 
comprising the step of: 
15 using a predictive mathematical model; 

wherein said model is fonned by applying a modelling method to 
modelling data; 

wherein said modelling data comprises at least one data set for each of a 
plurality of modelling samples; 
20 wherein said modelling samples define a dass group consisting of a 

plurality of classes; 

wherein each of said modelling samples is of a known class selected from 
said class group; 

with a data set for a test sample from said subject to classify said test 
25 sample as being a member of one class selected from said class group, and 

thereby diagnose said subject. 



* * * 



30 51 . A method according to any one of claims 1 to 50, wherein said test sample is a 
test sample firom a subject, and said predetermined condition is a predetermined 
condition of said subject 



* * * 



35 
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52. A method according to any one of claims 1 to 50, wherein said "a modulation of 
is "an increase or decrease in." 



5 

53. A method according to any one of claims 1 to 52, wherein said relating step 
involves the use of a predictive mathematical model. 

54. A method according to any one of claims 1 to 52. wherein said modelling method 
10 is a multivariate statistical analysis modelling method. 

55. A method according to any one of claims 1 to 52, wherein said modelling method 
is a multivariate statistical analysis modelling method which employs a pattern 
recognition method. 

15 

56. A method according to any one of claims 1 to 52, wherein said modelling method 
Is, or employs PCA. 

57. A method according to any one of claims 1 to 52, wherein said modelling method 
20 Is, or employs PLS. 

58. A method according to any one of claims 1 to 52, wherein said modelling method 
Is, or employs PL&CA. 

25 59. A method according to any one of claims 1 to 58, wherein said modelling method 
includes a step of data filtering. 

60. A method according to any one of claims 1 to 58, wherein said modelling method 
includes a step of orthogonal data filtering. 

30 

61. A method according to any one of claims 1 to 58, wherein said modelling method 
includes a step of OSC. 



35 
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62. A method according to any one of claims 1 to 61 , wherein said model takes 
account of one or more diagnostic species. 

63. A method according to any one of claims 1 to 62. wherein said modelling data 
5 comprise spectral data. 



64. A method according to any one of claims 1 to 62, wherein said modelling data 
comprise both spectral data and non-spectral data. 

10 65. A method according to any one of claims 1 to 62, wherein said modelling data 
comprise NMR spectral data. 

66. A method according to any one of claims 1 to 62, wherein said modelling data 
comprise both NMR spectral data and non-NMR spectral data. 

15 

67. A method according to any one of claims 1 to 62, wherein said NMR spectral 
data comprises NMR spectral data and/or ^^C NMR spectral data. 

68. A method according to any one of claims 1 to 62, wherein said NMR spectral 
20 data comprises NMR spectral data. 

i 

69. A method according to any one of claims 1 to 62, wherein said modelling data 
comprise spectra. 

25 70. A method according to any one of claims 1 to 62, wherein said modelling data are 
spectra. 

71 . A method according to any one of claims 1 to 70, wherein said modelling data 
comprises a plurality of data sets for modelling samples of known class. 

30 

72. A method according to any one of claims 1 to 70, wherein said modelling data 
comprises at least one data set for each of a plurality of modelling samples. 



73. 

35 



A method according to any one of claims 1 to 70. wherein said modelling data 
comprises exactly one data set for each of a plurality of modelling samples. 
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74. A method according to any one of claims 1 to 70, wherein said using step is: 

using said model with a data set for said test sample to classify said test sample 
as being a member of one dass selected from said dass group. 

5 75. A method according to any one of claims 1 to 74. wherein each of said data sets 
comprises spectral data. 

76. A method according to any one of claims 1 to 74, wherein each of said data sets 
comprises both spectral data and non-spectral data. 

10 

77. A method according to any one of claims 1 to 74, wherein each of said data sets 
comprises NMR spectral data. 

78. A method according to any one of claims 1 to 74, wherein each of said data sets 
15 comprises both NMR spectral data and non-NMR spectral data. 

79. A method according to any one of claims 1 to 74, wherein said NMR spectral 
data comprises NMR spectral data and/or ^^C NMR spectral data. 

20 80. A method according to any one of claims 1 to 74, wherein said NMR spectral 
data comprises NMR spectral data. 

81 . A method according to any one of claims 1 to 74, wherein each of said data sets 
comprises a spectrum. 

25 . 

82. A method according to any one of claims 1 to 74, wherein each of said data sets 
comprises a NMR spectrum and/or "C NMR spectrum. 

83. A method according to any one of claims 1 to 74, wherein each of said data sets 
30 comprises a NMR spectrum. 

84. A method according to any one of claims 1 to 74, wherein each of said data sets 
is a spectrum. 

35 85. A method according to any one of claims 1 to 74, wherein each of said data sets 
is a NMR spectrum and/or NMR spectrum. 



wo 02/086500 



-147- 



PCT/GB02/01854 



86. A method according to any one of claims 1 to 74, wherein each of said data sets 
is a tMR spectrum. 

5 87. A method according to any one of claims 1 to 86, wherein said non-spectral data 
is non-spectral clinical data. 

88. A method according to any one of claims 1 to 86, wherein said non-NMR spectral 
data is non-spectral clinical data. 

10 



89. A method according to any one of claims 1 to 88, wherein said class group 
comprises classes associated with said predetemiined condition. 

15 

90. A method according to any one of claims 1 to 88, wherein said class group 
comprises exactly two classes. 

91. A method according to any one of claims 1 to 88, wherein said class group 
20 comprises exactly two classes: presence of said predetermined condition; and 

absence of said predetemiined condition. 



* * * 



25 92. A method according to any one of claims 1 to 91 , wherein said sample is an in 
vivo sample. 

93. A method according to any one of claims 1 to 91 , wherein said sample is an ex 
vivo sample. 

30 

94. A method according to any one of claims 1 to 91 . wherein said sample is a blood 
sample or a blood-derived sample. 



95. 

35 



A method according to any one of claims 1 to 91, wherein said sample is a blood 
sample. 
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96. A method according to any one of claims 1 to 91. wherein said sample is a blood 
plasma sample. 

97. A method according to any one of claims 1 to 91 , wherein said sample is a blood 
serum sample. 



98. A method according to any one of claims 1 to 97, wherein said subject is an 
animal. 

99. A method according to any one of claims 1 to 97. wherein said subject is a 
mammal. 

100. A method according to any one of claims 1 to 97. wherein said subject is a 
human. 



101 . A method according to any one of claims 1 to 100. wherein said one or more 
predetermined diagnostic spectral windows is: a single predetermined diagnostic 
spectral window. 

102. A method according to any one of claims 1 to 100, wherein said one or more 
predetermined diagnostic spectral windows Is: a plurality of predetermined 
diagnostic spectral windows. 

103. A method according to any one of claims 1 to 100. wherein 

said one or more predetennined diagnostic spectral windows Is: a plurality 
of diagnostic spectral windows, and. 

said NMR spectral intensity at one or more predetemriined diagnostic 
spectral windows is: a combination of a plurafity of NMR spectral intensities, 
each of which is NMR spectral intensity for one of said plurality of predetennined 
diagnostic spectral windows. 
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104. A method according to claim 103, wherein said combination is a linear 
combination. 

105. A method according to any one of claims 1 to 104. wherein said one or more 
5 predetenmined diagnostic spectral windows are associated with one or more 

diagnostic species. 

106. A method according to any one of claims 1 to 1 04, wherein at least one of said 
one or more predetermined diagnostic spectral windows encompasses a 

10 chemical shift value for an NMR resonance of a diagnostic species. 

1 07. A method according to any one of claims 1 to 1 04, each of a plurality of said one 
or more predetermined diagnostic spectral windows encompasses a chemical 
shift value for an NMR resonance of a diagnostic species. 

15 

108. A method according to any one of claims 1 to 104, each of said one or more 
predetemiined diagnostic spectral windows encompasses a chemical shift value 
for an NMR resonanpe of a diagnostic species. 

20 109. A method according to any one of claims 106 to 108, wherein said NMR 
resonance is a NMR resonance. 

110. A method according to any one of claims 1 to 1 09, wherein said one or more 
diagnostic species are endogenous diagnostic species. 

25 

111. A method according to any one of claims 1 to 1 1 09, wherein said one or more 
diagnostic spades are associated with NMR spectral intensity at predetermined 
diagnostic spectra! windows. 

30 112. A method according to any one of claims 1 to 1 1 1 , said one or more diagnostic 
species are a plurality of diagnostic species. 

113. A method according to any one of claims 1 to 1 1 1 , said one or more diagnostic 
species is a single diagnostic species. 

35 
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114. A method according to any one of claims 1 to 1 1 3, wherein said classification is 
performed on the basis of an amount, or a relative amount, of a single diagnostic 
species. 

115. A method according to any one of claims 1 to 1 13, wherein said classification is 
perfonned on the basis of an amount, or a relative amount, of a plurality of 
diagnostic species. 

116. A method according to any one of claims 1 to 1 13. wherein said classification is 
perfonned on the basis of an amount, or a relatrve amount, of each of a plurality 
of diagnostic species. 

117. A method acconJing to any one of claims 1 to 1 13, wherein said classification is 
perfonned on the basis of a total amount, or a relative total amount, of a plurality 
of diagnostic species. 

118. A method according to any one of claims 1 to 1 1 3, wherein: 

said one or more diagnostic species is: a plurality of diagnostic species; 

and, 

said amount of, or relative amount of one or more diagnostic spedes is: a 
combination of a plurality of amounts, or relative amounts, each of which is the 
amount of. or relative amount of one of said plurality of diagnostic species. 

1 19. A method according to claim 118, wherein said combination is a linear 
combination. 



120. A method according to any one of claims 1 to 1 19. wherein said predetermined 
diagnostic spectral windows are defined by one or more index values, 6r, 
corresponding to the bucket regions listed in Table 4-CHD. 

121 . A method according to any one of claims 1 to 1 19, wherein at least one of said 
one or more predetermined diagnostic species is a species descritted in Table 4- 
CHD. 
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122. A method according to any one of claims 1 to 119, wherein each of a plurality of 
said one or more predetermined diagnostic species is a species described in 
Table 4-CHD. 

5 

123. A method according to any one of claims 1 to 119, wherein each of said one or 
more predetemnined diagnostic species is a spedes described In Table 4-CHD. 



124. A method of identifying a diagnostic species, or a combination of a plurality of 
diagnostic species, for a predetermined condition associated with 
atherosclerosis/coronary heart disease, said method comprising the steps of: 

(a) applying a multivariate statistical analysis method to experimental 

data; 

wherein said experimental data comprises at least one data comprising 
experimental parameters measured for each of a plurality of experimental 
samples; 

wherein said experimental samples define a class group consisting of a 
plurality of classes; 

wherein at least one of said plurality of classes is a class assodated with 
said predetermined condition, e.g., a class associated with the presence of said 
predetennined condition; 

wherein at least one of said plurality of classes is a class not associated 
with said predetennined condition, e.g., a dass assodated with the absence of 
said predetermined condition; 

wherein each of said experimental samples is of known dass selected 
from said class group; 

30 and: 

(b) identifying one or more critical experimental parameters; 
wherein each of said critical experimental parameters is statistically 

significantly different for dasses of said class group, e.g.. is statistically significant 
35 for discriminating between classes of said class group; and, 



10 



15 



20 



25 
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(c) matching each of one or more of said one or more critical experimental 
parameters with said diagnostic species; 

or 

(b) identifying a combination of a plurality of critical experimental 
parameters; 

wherein said combination of a plurality of critical experimental parameters 
is statistically significantly different for classes of said class group, e.g., is 
statistically significant for discriminating between classes of said class group; 
and, 

(c) matching each of one or more of said plurality of critical experimental 
parameters with said combination of a plurality of diagnostic species. 

125. A method, according to claim 124. wherein: 

one or more of said critical experimental parameters is a spectral 
parameter; and 

said identifying and matching steps are: 

(b) identifying one or more critical experimental spectral parameters; and, 

(c) matching each of one or more of said one or more critical experimental 
spectral parameters with a spectral feature, e.g., a spectral peak; 

and matching one or more of said spectral peaks with said diagnostic 
species; 

or 

(b) Identifying a combination of a plurality of critical experimental spectral 
parameters; and, 

(c) matching each of a plurality of said plurality of critical experimental 
spectral parameters with a spectral feature, e.g.. a spectral peak; 

and matching one or more of said spectral peaks with said combination of 
a plurality of diagnostic species. 

126. A method according to any one of claims 124 to 125, wherein said multivariate 
statistical analysis mettiod is a multivariate statistical analysis metiiod which 
employs a pattern recognition method. 
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127. A method according to any one of claims 124 to 126, wherein said multivariate 
statistical analysis method is, or employs PCA. 

5 128. A method according to any one of claims 124 to 126, wherein said multivariate 
statistical analysis method is, or employs PLS. 

129. A method according to any one of claims 124 to 126, wherein said multivariate 
statistical analysis method is, or employs PLS-DA. 

10 

130. A method according to any one of claims 124 to 129, wherein said multivariate 
statistical analysis method includes a step of data filtering. 

131 . A method according to any one of claims 124 to 129, wherein said multivariate 
15 statistical analysis method includes a step of orthogonal data filtering. 

132. A method according to any one of claims 124 to 129. wherein said multivariate 
statistical analysis method includes a step of OSC. 

20 133. A method according to any one of claims 124 to 132. wherein said experimental 
parameters comprise spectral data. 

134. A method according to any one of claims 124 to 132, wherein said experimental 
parameters comprise both spectral data and non-spectral data. 

25 

135. A method according to any one of claims 124 to 132. wherein said experimental 
parameters comprise NMR spectral data. 

136. A method according to any one of dafms 124 to 132. wherein said experimental 
30 parameters comprise both NMR spectral data and non-NMR spectral data. 

137. A method according to any one of claims 124 to 136, wherein said NMR spectral 
data comprises NMR spectral data and/or ^^C NMR spectral data. 



35 138. 



A method according to any one of claims 124 to 136, wherein said NMR spectral 
data comprises NMR spectral data 
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1 39. A method according to any one of claims 124 to 138, wherein said non-spectral 
data is non-spectral clinical data. 

140. A method according to any one of claims 124 to 138, wherein said non-NMR 
spectral data is non-spectral clinical data. 

141 . A method according to any one of claims 1 24 to 140. wherein said critical 
experimental parameters are spectral parameters. 

142. A method according to any one of claims 124 to 141, wherein said class gn>up 
comprises classes associated with said predetennined condition. 

143. A method according to any one of claims 124 to 142. wherein said class gnaup 
comprises exactly two classes. 

144. A method according to any one of claims 124 to 142, wherein said class group 
comprises exactly two classes: presence of said predetennined condition; and 
absence of said predetermined condition. 

145. A method according to any one of claims 124 to 142, wherein said class 
associated with said predetermined condition is a class associated with the 
presence of said predetermined condition. 

146. A method according to any one of claims 124 to 142, wherein said dass not 
associated with said predetennined condition is a dass assodated with the 
absence of said predetennined condition. 

147. A method according to any one of daims 124 to 146, said method further 
comprising the additional step of: 

(d) confirming the Identity of said diagnostic species. 
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148. A computer system or device, such as a computer or linked computers, 

operatlvely configured to implement a method according to any one of claims 1 to 
147. 



5 149. Computer code suitable for implementing a method according to any one of 
claims 1 to 147 on a suitable computer system. 

150. A computer program comprising computer program means adapted to perform a 
method according to according to any one of claims 1 to 147, when said program 

10 is run on a computer. 

151 . A computer program according to claim 150, embodied on a computer readable 
medium. 

15 152. A data carrier which canies computer code suitable for implementing a method 
according to any one of claims 1 to 147 on a suitable computer. 

153. Computer code and/or computer readable data representing a predictive 
mathematical model as described in any one of claims 1 to 147. 

20 

154. A data carrier which cam'es computer code and/or computer readable data 
representing a predictive mathematical model as described in any one of claims 1 
to 147. 



25 1 55. A computer system or device, such as a computer or linked computers, 

programmed or loaded with computer code and/or computer readable data 
representing a predictive mathematical model as described in any one of daims 1 
to 147. 



30 156. A system comprising: 

(a) a first component comprising a device for obtaining NMR spectral 
intensity data for a sample; and, 

(b) a second component comprising computer system or device, such as 
a computer or linked computers, operatively configured to implement a method 

35 according to any one of claims 1 to 147, and operatively linked to said first 

component. 
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* * * 



1 57. A diagnostic species identified by a method according to any one of claims 124 to 
147. 



168. A diagnostic species identified by a method according to any one of claims 124 to 
147 for use In a method of classification. 

159. A method of classification which employs or relies upon one or more diagnostic 
species idenfified by a method according to any one of claims 124 to 147. 

1 60. Use of one or more diagnostic species Identified by a method of classification 
according to any one of claims 124 to 147. 

161 . An assay for use in a method of classification, which assay relies upon one or 
more diagnostic species identified by a method accorxling to any one of claims 
124 to 147. 



162. Use of an assay in a method of classification, which assay relies upon one or 
more diagnostic spedes Identified by a method according to any one of claims 
124 to 147. 



* * * 



1 63. A method of therapeutic monitoring of a subject undergoing therapy which 
employs a method of classification according to any one of claims 1 to 123. 

1 64. A method of evaluating drug therapy and/or drug efficacy which employs a 
method of classification according to any one of claims 1 to 123. 
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